Tag Archives: Application Services*

How we ensure Cloudflare customers aren’t affected by Let’s Encrypt’s certificate chain change

2024-04-12 Dina Kozlov

Post Syndicated from Dina Kozlov original https://blog.cloudflare.com/shortening-lets-encrypt-change-of-trust-no-impact-to-cloudflare-customers

Let’s Encrypt, a publicly trusted certificate authority (CA) that Cloudflare uses to issue TLS certificates, has been relying on two distinct certificate chains. One is cross-signed with IdenTrust, a globally trusted CA that has been around since 2000, and the other is Let’s Encrypt’s own root CA, ISRG Root X1. Since Let’s Encrypt launched, ISRG Root X1 has been steadily gaining its own device compatibility.

On September 30, 2024, Let’s Encrypt’s certificate chain cross-signed with IdenTrust will expire. After the cross-sign expires, servers will no longer be able to serve certificates signed by the cross-signed chain. Instead, all Let’s Encrypt certificates will use the ISRG Root X1 CA.

Most devices and browser versions released after 2016 will not experience any issues as a result of the change since the ISRG Root X1 will already be installed in those clients’ trust stores. That’s because these modern browsers and operating systems were built to be agile and flexible, with upgradeable trust stores that can be updated to include new certificate authorities.

The change in the certificate chain will impact legacy devices and systems, such as devices running Android version 7.1.1 (released in 2016) or older, as those exclusively rely on the cross-signed chain and lack the ISRG X1 root in their trust store. These clients will encounter TLS errors or warnings when accessing domains secured by a Let’s Encrypt certificate. We took a look at the data ourselves and found that, of all Android requests, 2.96% of them come from devices that will be affected by the change. That’s a substantial portion of traffic that will lose access to the Internet. We’re committed to keeping those users online and will modify our certificate pipeline so that we can continue to serve users on older devices without requiring any manual modifications from our customers.

A better Internet, for everyone

In the past, we invested in efforts like “No Browsers Left Behind” to help ensure that we could continue to support clients as SHA-1 based algorithms were being deprecated. Now, we’re applying the same approach for the upcoming Let’s Encrypt change.

We have made the decision to remove Let’s Encrypt as a certificate authority from all flows where Cloudflare dictates the CA, impacting Universal SSL customers and those using SSL for SaaS with the “default CA” choice.

Starting in June 2024, one certificate lifecycle (90 days) before the cross-sign chain expires, we’ll begin migrating Let’s Encrypt certificates that are up for renewal to use a different CA, one that ensures compatibility with older devices affected by the change. That means that going forward, customers will only receive Let’s Encrypt certificates if they explicitly request Let’s Encrypt as the CA.

The change that Let’s Encrypt is making is a necessary one. For us to move forward in supporting new standards and protocols, we need to make the Public Key Infrastructure (PKI) ecosystem more agile. By retiring the cross-signed chain, Let’s Encrypt is pushing devices, browsers, and clients to support adaptable trust stores.

However, we’ve observed changes like this in the past and while they push the adoption of new standards, they disproportionately impact users in economically disadvantaged regions, where access to new technology is limited.

Our mission is to help build a better Internet and that means supporting users worldwide. We previously published a blog post about the Let’s Encrypt change, asking customers to switch their certificate authority if they expected any impact. However, determining the impact of the change is challenging. Error rates due to trust store incompatibility are primarily logged on clients, reducing the visibility that domain owners have. In addition, while there might be no requests incoming from incompatible devices today, it doesn’t guarantee uninterrupted access for a user tomorrow.

Cloudflare’s certificate pipeline has evolved over the years to be resilient and flexible, allowing us to seamlessly adapt to changes like this without any negative impact to our customers.

How Cloudflare has built a robust TLS certificate pipeline

Today, Cloudflare manages tens of millions of certificates on behalf of customers. For us, a successful pipeline means:

Customers can always obtain a TLS certificate for their domain
CA related issues have zero impact on our customer’s ability to obtain a certificate
The best security practices and modern standards are utilized
Optimizing for future scale
Supporting a wide range of clients and devices

Every year, we introduce new optimizations into our certificate pipeline to maintain the highest level of service. Here’s how we do it…

Ensuring customers can always obtain a TLS certificate for their domain

Since the launch of Universal SSL in 2014, Cloudflare has been responsible for issuing and serving a TLS certificate for every domain that’s protected by our network. That might seem trivial, but there are a few steps that have to successfully execute in order for a domain to receive a certificate:

Domain owners need to complete Domain Control Validation for every certificate issuance and renewal.
The certificate authority needs to verify the Domain Control Validation tokens to issue the certificate.
CAA records, which dictate which CAs can be used for a domain, need to be checked to ensure only authorized parties can issue the certificate.
The certificate authority must be available to issue the certificate.

Each of these steps requires coordination across a number of parties — domain owners, CDNs, and certificate authorities. At Cloudflare, we like to be in control when it comes to the success of our platform. That’s why we make it our job to ensure each of these steps can be successfully completed.

We ensure that every certificate issuance and renewal requires minimal effort from our customers. To get a certificate, a domain owner has to complete Domain Control Validation (DCV) to prove that it does in fact own the domain. Once the certificate request is initiated, the CA will return DCV tokens which the domain owner will need to place in a DNS record or an HTTP token. If you’re using Cloudflare as your DNS provider, Cloudflare completes DCV on your behalf by automatically placing the TXT token returned from the CA into your DNS records. Alternatively, if you use an external DNS provider, we offer the option to Delegate DCV to Cloudflare for automatic renewals without any customer intervention.

Once DCV tokens are placed, Certificate Authorities (CAs) verify them. CAs conduct this verification from multiple vantage points to prevent spoofing attempts. However, since these checks are done from multiple countries and ASNs (Autonomous Systems), they may trigger a Cloudflare WAF rule which can cause the DCV check to get blocked. We made sure to update our WAF and security engine to recognize that these requests are coming from a CA to ensure they’re never blocked so DCV can be successfully completed.

Some customers have CA preferences, due to internal requirements or compliance regulations. To prevent an unauthorized CA from issuing a certificate for a domain, the domain owner can create a Certification Authority Authorization (CAA) DNS record, specifying which CAs are allowed to issue a certificate for that domain. To ensure that customers can always obtain a certificate, we check the CAA records before requesting a certificate to know which CAs we should use. If the CAA records block all of the CAs that are available in Cloudflare’s pipeline and the customer has not uploaded a certificate from the CA of their choice, then we add CAA records on our customers’ behalf to ensure that they can get a certificate issued. Where we can, we optimize for preference. Otherwise, it’s our job to prevent an outage by ensuring that there’s always a TLS certificate available for the domain, even if it does not come from a preferred CA.

Today, Cloudflare is not a publicly trusted certificate authority, so we rely on the CAs that we use to be highly available. But, 100% uptime is an unrealistic expectation. Instead, our pipeline needs to be prepared in case our CAs become unavailable.

Ensuring that CA-related issues have zero impact on our customer’s ability to obtain a certificate

At Cloudflare, we like to think ahead, which means preventing incidents before they happen. It’s not uncommon for CAs to become unavailable — sometimes this happens because of an outage, but more commonly, CAs have maintenance periods every so often where they become unavailable for some period of time.

It’s our job to ensure CA redundancy, which is why we always have multiple CAs ready to issue a certificate, ensuring high availability at all times. If you’ve noticed different CAs issuing your Universal SSL certificates, that’s intentional. We evenly distribute the load across our CAs to avoid any single point of failure. Plus, we keep a close eye on latency and error rates to detect any issues and automatically switch to a different CA that’s available and performant. You may not know this, but one of our CAs has around 4 scheduled maintenance periods every month. When this happens, our automated systems kick in seamlessly, keeping everything running smoothly. This works so well that our internal teams don’t get paged anymore because everything just works.

Adopting best security practices and modern standards

Security has always been, and will continue to be, Cloudflare’s top priority, and so maintaining the highest security standards to safeguard our customer’s data and private keys is crucial.

Over the past decade, the CA/Browser Forum has advocated for reducing certificate lifetimes from 5 years to 90 days as the industry norm. This shift helps minimize the risk of a key compromise. When certificates are renewed every 90 days, their private keys remain valid for only that period, reducing the window of time that a bad actor can make use of the compromised material.

We fully embrace this change and have made 90 days the default certificate validity period. This enhances our security posture by ensuring regular key rotations, and has pushed us to develop tools like DCV Delegation that promote automation around frequent certificate renewals, without the added overhead. It’s what enables us to offer certificates with validity periods as low as two weeks, for customers that want to rotate their private keys at a high frequency without any concern that it will lead to certificate renewal failures.

Cloudflare has always been at the forefront of new protocols and standards. It’s no secret that when we support a new protocol, adoption skyrockets. This month, we will be adding ECDSA support for certificates issued from Google Trust Services. With ECDSA, you get the same level of security as RSA but with smaller keys. Smaller keys mean smaller certificates and less data passed around to establish a TLS connection, which results in quicker connections and faster loading times.

Optimizing for future scale

Today, Cloudflare issues almost 1 million certificates per day. With the recent shift towards shorter certificate lifetimes, we continue to improve our pipeline to be more robust. But even if our pipeline can handle the significant load, we still need to rely on our CAs to be able to scale with us. With every CA that we integrate, we instantly become one of their biggest consumers. We hold our CAs to high standards and push them to improve their infrastructure to scale. This doesn’t just benefit Cloudflare’s customers, but it helps the Internet by requiring CAs to handle higher volumes of issuance.

And now, with Let’s Encrypt shortening their chain of trust, we’re going to add an additional improvement to our pipeline — one that will ensure the best device compatibility for all.

Supporting all clients — legacy and modern

The upcoming Let’s Encrypt change will prevent legacy devices from making requests to domains or applications that are protected by a Let’s Encrypt certificate. We don’t want to cut off Internet access from any part of the world, which means that we’re going to continue to provide the best device compatibility to our customers, despite the change.

Because of all the recent enhancements, we are able to reduce our reliance on Let’s Encrypt without impacting the reliability or quality of service of our certificate pipeline. One certificate lifecycle (90 days) before the change, we are going to start shifting certificates to use a different CA, one that’s compatible with the devices that will be impacted. By doing this, we’ll mitigate any impact without any action required from our customers. The only customers that will continue to use Let’s Encrypt are ones that have specifically chosen Let’s Encrypt as the CA.

What to expect of the upcoming Let’s Encrypt change

Let’s Encrypt’s cross-signed chain will expire on September 30th, 2024. Although Let’s Encrypt plans to stop issuing certificates from this chain on June 6th, 2024, Cloudflare will continue to serve the cross-signed chain for all Let’s Encrypt certificates until September 9th, 2024.

90 days or one certificate lifecycle before the change, we are going to start shifting Let’s Encrypt certificates to use a different certificate authority. We’ll make this change for all products where Cloudflare is responsible for the CA selection, meaning this will be automatically done for customers using Universal SSL and SSL for SaaS with the “default CA” choice.

Any customers that have specifically chosen Let’s Encrypt as their CA will receive an email notification with a list of their Let’s Encrypt certificates and information on whether or not we’re seeing requests on those hostnames coming from legacy devices.

After September 9th, 2024, Cloudflare will serve all Let’s Encrypt certificates using the ISRG Root X1 chain. Here is what you should expect based on the certificate product that you’re using:

Universal SSL

With Universal SSL, Cloudflare chooses the CA that is used for the domain’s certificate. This gives us the power to choose the best certificate for our customers. If you are using Universal SSL, there are no changes for you to make to prepare for this change. Cloudflare will automatically shift your certificate to use a more compatible CA.

Advanced Certificates

With Advanced Certificate Manager, customers specifically choose which CA they want to use. If Let’s Encrypt was specifically chosen as the CA for a certificate, we will respect the choice, because customers may have specifically chosen this CA due to internal requirements, or because they have implemented certificate pinning, which we highly discourage.

If we see that a domain using an Advanced certificate issued from Let’s Encrypt will be impacted by the change, then we will send out email notifications to inform those customers which certificates are using Let’s Encrypt as their CA and whether or not those domains are receiving requests from clients that will be impacted by the change. Customers will be responsible for changing the CA to another provider, if they chose to do so.

SSL for SaaS

With SSL for SaaS, customers have two options: using a default CA, meaning Cloudflare will choose the issuing authority, or specifying which CA to use.

If you’re leaving the CA choice up to Cloudflare, then we will automatically use a CA with higher device compatibility.

If you’re specifying a certain CA for your custom hostnames, then we will respect that choice. We will send an email out to SaaS providers and platforms to inform them which custom hostnames are receiving requests from legacy devices. Customers will be responsible for changing the CA to another provider, if they chose to do so.

Custom Certificates

If you directly integrate with Let’s Encrypt and use Custom Certificates to upload your Let’s Encrypt certs to Cloudflare then your certificates will be bundled with the cross-signed chain, as long as you choose the bundle method “compatible” or “modern” and upload those certificates before September 9th, 2024. After September 9th, we will bundle all Let’s Encrypt certificates with the ISRG Root X1 chain. With the “user-defined” bundle method, we always serve the chain that’s uploaded to Cloudflare. If you upload Let’s Encrypt certificates using this method, you will need to ensure that certificates uploaded after September 30th, 2024, the date of the CA expiration, contain the right certificate chain.

In addition, if you control the clients that are connecting to your application, we recommend updating the trust store to include the ISRG Root X1. If you use certificate pinning, remove or update your pin. In general, we discourage all customers from pinning their certificates, as this usually leads to issues during certificate renewals or CA changes.

Conclusion

Internet standards will continue to evolve and improve. As we support and embrace those changes, we also need to recognize that it’s our responsibility to keep users online and to maintain Internet access in the parts of the world where new technology is not readily available. By using Cloudflare, you always have the option to choose the setup that’s best for your application.

For additional information regarding the change, please refer to our developer documentation.

Browser Rendering API GA, rolling out Cloudflare Snippets, SWR, and bringing Workers for Platforms to all users

2024-04-05 Tanushree Sharma

Post Syndicated from Tanushree Sharma original https://blog.cloudflare.com/browser-rendering-api-ga-rolling-out-cloudflare-snippets-swr-and-bringing-workers-for-platforms-to-our-paygo-plans

Browser Rendering API is now available to all paid Workers customers with improved session management

In May 2023, we announced the open beta program for the Browser Rendering API. Browser Rendering allows developers to programmatically control and interact with a headless browser instance and create automation flows for their applications and products.

At the same time, we launched a version of the Puppeteer library that works with Browser Rendering. With that, developers can use a familiar API on top of Cloudflare Workers to create all sorts of workflows, such as taking screenshots of pages or automatic software testing.

Today, we take Browser Rendering one step further, taking it out of beta and making it available to all paid Workers’ plans. Furthermore, we are enhancing our API and introducing a new feature that we’ve been discussing for a long time in the open beta community: session management.

Session Management

Session management allows developers to reuse previously opened browsers across Worker’s scripts. Reusing browser sessions has the advantage that you don’t need to instantiate a new browser for every request and every task, drastically increasing performance and lowering costs.

Before, to keep a browser instance alive and reuse it, you’d have to implement complex code using Durable Objects. Now, we’ve simplified that for you by keeping your browsers running in the background and extending the Puppeteer API with new session management methods that give you access to all of your running sessions, activity history, and active limits.

Here’s how you can list your active sessions:

const sessions = await puppeteer.sessions(env.RENDERING);
console.log(sessions);
[
   {
      "connectionId": "2a2246fa-e234-4dc1-8433-87e6cee80145",
      "connectionStartTime": 1711621704607,
      "sessionId": "478f4d7d-e943-40f6-a414-837d3736a1dc",
      "startTime": 1711621703708
   },
   {
      "sessionId": "565e05fb-4d2a-402b-869b-5b65b1381db7",
      "startTime": 1711621703808
   }
]

We have added a Worker script example on how to use session management to the Developer Documentation.

Analytics and logs

Observability is an essential part of any Cloudflare product. You can find detailed analytics and logs of your Browser Rendering usage in the dashboard under your account’s Worker & Pages section.

Browser Rendering is now available to all customers with a paid Workers plan. Each account is limited to running two new browsers per minute and two concurrent browsers at no cost during this period. Check our developers page to get started.

We are rolling out access to Cloudflare Snippets

Powerful, programmable, and free of charge, Snippets are the best way to perform complex HTTP request and response modifications on Cloudflare. What was once too advanced to achieve using Rules products is now possible with Snippets. Since the initial announcement during Developer Week 2022, the promise of extending out-of-the-box Rules functionality by writing simple JavaScript code is keeping the Cloudflare community excited.

During the first 3 months of 2024 alone, the amount of traffic going through Snippets increased over 7x, from an average of 2,200 requests per second in early January to more than 17,000 in March.

However, instead of opening the floodgates and letting millions of Cloudflare users in to test (and potentially break) Snippets in the most unexpected ways, we are going to pace ourselves and opt for a phased rollout, much like the newly released Gradual Rollouts for Workers.

In the next few weeks, 5% of Cloudflare users will start seeing “Snippets” under the Rules tab of the zone-level menu in their dashboard. If you happen to be part of the first 5%, snip into action and try out how fast and powerful Snippets are even for advanced use cases like dynamically changing the date in headers or A / B testing leveraging the `math.random` function. Whatever you use Snippets for, just keep one thing in mind: this is still an alpha, so please do not use Snippets for production traffic just yet.

Until then, keep your eyes out for the new Snippets tab in the Cloudflare dashboard and learn more how powerful and flexible Snippets are at the developer documentation in the meantime.

Coming soon: asynchronous revalidation with stale-while-revalidate

One of the features most requested by our customers is the asynchronous revalidation with stale-while-revalidate (SWR) cache directive, and we will be bringing this to you in the second half of 2024. This functionality will be available by design as part of our new CDN architecture that is being built using Rust with performance and memory safety at top of mind.

Currently, when a client requests a resource, such as a web page or an image, Cloudflare checks to see if the asset is in cache and provides a cached copy if available. If the file is not in the cache or has expired and become stale, Cloudflare connects to the origin server to check for a fresh version of the file and forwards this fresh version to the end user. This wait time adds latency to these requests and impacts performance.

Stale-while-revalidate is a cache directive that allows the expired or stale version of the asset to be served to the end user while simultaneously allowing Cloudflare to check the origin to see if there’s a fresher version of the resource available. If an updated version exists, the origin forwards it to Cloudflare, updating the cache in the process. This mechanism allows the client to receive a response quickly from the cache while ensuring that it always has access to the most up-to-date content. Stale-while-revalidate strikes a balance between serving content efficiently and ensuring its freshness, resulting in improved performance and a smoother user experience.

Customers who want to be part of our beta testers and “cache” in on the fun can register here, and we will let you know when the feature is ready for testing!

Coming on April 16, 2024: Workers for Platforms for our pay-as-you-go plan

Today, we’re excited to share that on April 16th, Workers for Platforms will be available to all developers through our new $25 pay-as-you-go plan!

Workers for Platforms is changing the way we build software – it gives you the ability to embed personalization and customization directly into your product. With Workers for Platforms, you can deploy custom code on behalf of your users or let your users directly deploy their own code to your platform, without you or your users having to manage any infrastructure. You can use Workers for Platforms with all the exciting announcements that have come out this Developer Week – it supports all the bindings that come with Workers (including Workers AI, D1 and Durable Objects) as well as Python Workers.

Here’s what some of our customers – ranging from enterprises to startups – are building on Workers for Platforms:

Shopify Oxygen is a hosting platform for their Remix-based eCommerce framework Hydrogen, and it’s built on Workers for Platforms! The Hydrogen/Oxygen combination gives Shopify merchants control over their buyer experience without the restrictions of generic storefront templates.
Grafbase is a data platform for developers to create a serverless GraphQL API that unifies data sources across a business under one endpoint. They use Workers for Platforms to give their developers the control and flexibility to deploy their own code written in JavaScript/TypeScript or WASM.
Triplit is an open-source database that syncs data between server and browser in real-time. It allows users to build low latency, real-time applications with features like relational querying, schema management and server-side storage built in. Their query and sync engine is built on top of Durable Objects, and they’re using Workers for Platforms to allow their customers to package custom Javascript alongside their Triplit DB instance.

Tools for observability and platform level controls

Workers for Platforms doesn’t just allow you to deploy Workers to your platform – we also know how important it is to have observability and control over your users’ Workers. We have a few solutions that help with this:

Custom Limits: Set CPU time or subrequest caps on your users’ Workers. Can be used to set limits in order to control your costs on Cloudflare and/or shape your own pricing and packaging model. For example, if you run a freemium model on your platform, you can lower the CPU time limit for customers on your free tier.
Tail Workers: Tail Worker events contain metadata about the Worker, console.log() messages, and capture any unhandled exceptions. They can be used to provide your developers with live logging in order to monitor for errors and troubleshoot in real time.
Outbound Workers: Get visibility into all outgoing requests from your users’ Workers. Outbound Workers sit between user Workers and the fetch() requests they make, so you get full visibility over the request before it’s sent out to the Internet.

Pricing

We wanted to make sure that Workers for Platforms was affordable for hobbyists, solo developers, and indie developers. Workers for Platforms is part of a new $25 pay-as-you-go plan, and it includes the following:

	Included Amounts
Requests	20 million requests/month +$0.30 per additional million
CPU time	60 million CPU milliseconds/month +$0.02 per additional million CPU milliseconds
Scripts	1000 scripts +0.02 per additional script/month

Workers for Platforms will be available to purchase on April 16, 2024!

The Workers for Platforms will be available to purchase under the Workers for Platforms tab on the Cloudflare Dashboard on April 16, 2024.

In the meantime, to learn more about Workers for Platforms, check out our starter project and developer documentation.

Introducing Requests for Information (RFIs) and Priority Intelligence Requirements (PIRs) for threat intelligence teams

2024-03-08 Javier Castro

Post Syndicated from Javier Castro original https://blog.cloudflare.com/threat-intel-rfi-pir

Cloudforce One is our threat operations and research team. Its primary objective: track and disrupt threat actors targeting Cloudflare and the customer systems we protect. Cloudforce One customers can engage directly with analysts on the team to help understand and stop the specific threats targeting them.

Today, we are releasing in general availability two new tools that will help Cloudforce One customers get the best value out of the service by helping us prioritize and organize the information that matters most to them: Requests for Information (RFIs) and Priority Intelligence Requirements (PIRs). We’d also like to review how we’ve used the Cloudflare Workers and Pages platform to build our internal pipeline to not only perform investigations on behalf of our customers, but conduct our own internal investigations of the threats and attackers we track.

What are Requests for Information (RFIs)?

RFIs are designed to streamline the process of accessing critical intelligence. They provide an avenue for users to submit specific queries and requests directly into Cloudforce One’s analysis queue. Essentially, they are a well-structured way for you to tell the team what to focus their research on to best support your security posture.

Each RFI filed is routed to an analyst and treated as a targeted call for information on specific threat elements. From malware analysis to DDoS attack analysis, we have a group of seasoned threat analysts who can provide deeper insight into a wide array of attacks. Those who have found RFIs invaluable typically belong to Security Operation Centers, Incident Response Teams, and Threat Research/Intelligence teams dedicated to supporting internal investigations within an organization. This approach proves instrumental in unveiling potential vulnerabilities and enhancing the understanding of the security posture, especially when confronting complex risks.

Creating an RFI is straightforward. Through the Security Center dashboard, users can create and track their RFIs:

Submission: Submit requests via Cloudforce One RFI Dashboard:
a. Threat: The threat or campaign you would like more information on
b. Priority: routine, high or urgent
c. Type: Binary Analysis, Indicator Analysis, Traffic Analysis, Threat Detection Signature, Passive DNS Resolution, DDoS Attack or Vulnerability
d. Output: Malware Analysis Report, Indicators of Compromise, or Threat Research Report
Tracking: Our Threat Research team begins work and the customer can track progress (open, in progress, pending, published, complete) via the RFI Dashboard. Automated alerts are sent to the customer with each status change.
Delivery: Customers can access/download the RFI response via the RFI Dashboard.

*Fabricated example of the detailed view of an RFI and communication with the Cloudflare Threat Research Team*

Once an RFI is submitted, teams can stay informed about the progress of their requests through automated alerts. These alerts, generated when a Cloudforce One analyst has completed the request, are delivered directly to the user’s email or to a team chat channel via a webhook.

What are Priority Intelligence Requirements (PIRs)?

Priority Intelligence Requirements (PIRs) are a structured approach to identifying intelligence gaps, formulating precise requirements, and organizing them into categories that align with Cloudforce One’s overarching goals. For example, you can create a PIR signaling to the Cloudforce One team what topic you would like more information on.

*PIR dashboard with fictitious examples of priority intelligence requirements*

PIRs help target your intelligence collection efforts toward the most relevant insights, enabling you to make informed decisions and strengthen your organization’s cybersecurity posture.

While PIRs currently offer a framework for prioritizing intelligence requirements, our vision extends beyond static requirements. Looking ahead, our plan is to evolve PIRs into dynamic tools that integrate real-time intelligence from Cloudforce One. Enriching PIRs by integrating them with real-time intelligence from Cloudforce One will provide immediate insights into your Cloudflare environment, facilitating a direct and meaningful connection between ongoing threat intelligence and your predefined intelligence needs.

What drives Cloudforce One?

Since our inception, Cloudforce One has been actively collaborating with our Security Incident Response Team (SIRT) and Trust and Safety (T&S) team, aiming to provide valuable insights into attacks targeting Cloudflare and counteract the misuse of Cloudflare services. Throughout these investigations, we recognized the need for a centralized platform to capture insights from Cloudflare’s unique perspective on the Internet, aggregate data, and correlate reports.

In the past, our approach would have involved deploying a frontend UI and backend API in a core data center, leveraging common services like Postgres, Redis, and a Ceph storage solution. This conventional route would have entailed managing Docker deployments, constantly upgrading hosts for vulnerabilities, and dealing with a complex environment where we must juggle secrets, external service configurations, and maintaining availability.

Instead, we welcomed being Customer Zero for Cloudflare and fully embraced Cloudflare’s Workers and Pages platforms to construct a powerful threat investigation tool, and since then, we haven’t looked back. For anyone that has used Workers in the past, much of what we have done is not revolutionary, but almost commonplace given the ease of configuring and implementing the features in Cloudflare Workers. We routinely store file data in R2, metadata in KV, and indexed data in D1. That being said, we do have a few non-standard deployments as well, further outlined below.

Altogether, our Threats Investigation architecture consists of five services, four of which are deployed at the edge with the other one deployed in our core data centers due to data dependency constraints.

RFIs & PIRs: This API manages our formal Cloudforce One requests and customer priorities submitted via the Cloudflare Dashboard.
Threats: Our UI, deployed via Pages, serves as the interface for interacting with all of our Cloudforce One services, Cloudflare internal services, and the RFIs and PIRs submitted by our customers.
Cases: A case management system that allows analysts to store notes, Indicators of Compromise (IOCs), malware samples, and data analytics related to an attack. The service provides live updates to all analysts viewing the case, facilitating real-time collaboration. Each case is a Durable Object that is connected to via a Websocket that stores “files” and “file content” in the Durable Object’s persistent storage. Metadata for the case is made searchable via D1.
Leads: A queue of informal internal and external requests that may be reviewed by Cloudforce One when doing threat hunting discovery. Lead content is stored into KV, while metadata and extracted IOCs are stored in D1.
Binary DB: A raw binary file warehouse for any file we come across during our investigation. Binary DB also serves as the repository for malware samples used in some of our machine learning training. Each file is stored in R2, with its associated metadata stored in KV.

*Cloudforce One Threat Investigation Architecture*

At the heart of our Threats ecosystem is our case management service built on Workers and Durable Objects. We were inspired to build this tool because we often had to jump into collaborative documents that were not designed to store forensic data, organize it, mark sections with Traffic Light Protocol (TLP) releasability codes, and relate analysis to existing RFIs or Leads.

Our concept of cases is straightforward — each case is a Durable Object that can accept HTTP REST API or WebSocket connections. Upon initiating a WebSocket connection, it is seamlessly incorporated into the Durable Object’s in-memory state, allowing us to instantly broadcast real-time events to all users engaged with the case. Each case comprises distinct folders, each housing a collection of files containing content, releasability information, and file metadata.

Practically, our Durable Object leverages its persistent storage with each storage key prefixed with the value type: “case”, “folder”, or “file” followed by the UUID assigned to the file. Each case value has metadata associated with the case and a list of folders that belong to the case. Each folder has the folder’s name and a list of files that belong to it.

Our internal Threats UI helps us tie together the service integrations with our threat hunting analysis. It is here we do our day-to-day work which allows us to bring our unique insights into Cloudflare attacks. Below is an example of our Case Management in action where we tracked the RedAlerts attack before we formalized our analysis into the blog.

What good is all of this if we can’t search it? The Workers AI team launched Vectorize and enabled inference on the edge, so we decided to go all in on Workers and began indexing all case files as they’re being edited so that they can be searched. As each case file is being updated in the Durable Object, the content of the file is pushed to Cloudflare Queues. This data is consumed by an indexing engine consumer that does two things: extracts and indexes indicators of compromise, and embeds the content into a vector and pushes it into Vectorize. Both of the search mechanisms also pass the reference case and file identifiers so that the case may be found upon searching.

Given how easy it is to set up Workers AI, we took the final step of implementing a full Retrieval Augmented Generation (RAG) AI to allow analysts to ask questions about our previous analysis. Each question undergoes the same process as the content that is indexed. We pull out any indicators of compromise and embed the question into a vector, so we can use both results to search our indexes and Vectorize respectively, and provide the most relevant results for the request. Lastly, we send the vector data to a text-generation model using Workers AI that then returns a response to our analysts.

Using RFIs and PIRs

Imagine submitting an RFI for “Passive DNS Resolution – IOCs” and receiving real-time updates directly within the PIR, guiding your next steps.

Our workflow ensures that the intelligence you need is not only obtained but also used optimally. This approach empowers your team to tailor your intelligence gathering, strengthening your cybersecurity strategy and security posture.

Our mission for Cloudforce One is to equip organizations with the tools they need to stay one step ahead in the rapidly changing world of cybersecurity. The addition of RFIs and PIRs marks another milestone in this journey, empowering users with enhanced threat intelligence capabilities.

Getting started

Cloudforce One customers can already see the PIR and RFI Dashboard in their Security Center, and they can also use the API if they prefer that option. Click to see more documentation about our RFI and our PIR APIs.

If you’re looking to try out the new RFI and PIR capabilities within the Security Center, contact your Cloudflare account team or fill out this form and someone will be in touch. Finally, if you’re interested in joining the Cloudflare team, check out our open job postings here.

Eliminate VPN vulnerabilities with Cloudflare One

2024-03-06 Dan Hall

Post Syndicated from Dan Hall original https://blog.cloudflare.com/eliminate-vpn-vulnerabilities-with-cloudflare-one

On January 19, 2024, the Cybersecurity & Infrastructure Security Agency (CISA) issued Emergency Directive 24-01: Mitigate Ivanti Connect Secure and Ivanti Policy Secure Vulnerabilities. CISA has the authority to issue emergency directives in response to a known or reasonably suspected information security threat, vulnerability, or incident. U.S. Federal agencies are required to comply with these directives.

Federal agencies were directed to apply a mitigation against two recently discovered vulnerabilities; the mitigation was to be applied within three days. Further monitoring by CISA revealed that threat actors were continuing to exploit the vulnerabilities and had developed some workarounds to earlier mitigations and detection methods. On January 31, CISA issued Supplemental Direction V1 to the Emergency Directive instructing agencies to immediately disconnect all instances of Ivanti Connect Secure and Ivanti Policy Secure products from agency networks and perform several actions before bringing the products back into service.

This blog post will explore the threat actor’s tactics, discuss the high-value nature of the targeted products, and show how Cloudflare’s Secure Access Service Edge (SASE) platform protects against such threats.

As a side note and showing the value of layered protections, Cloudflare’s WAF had proactively detected the Ivanti zero-day vulnerabilities and deployed emergency rules to protect Cloudflare customers.

Threat Actor Tactics

Forensic investigations (see the Volexity blog for an excellent write-up) indicate that the attacks began as early as December 2023. Piecing together the evidence shows that the threat actors chained two previously unknown vulnerabilities together to gain access to the Connect Secure and Policy Secure appliances and achieve unauthenticated remote code execution (RCE).

CVE-2023-46805 is an authentication bypass vulnerability in the products’ web components that allows a remote attacker to bypass control checks and gain access to restricted resources. CVE-2024-21887 is a command injection vulnerability in the products’ web components that allows an authenticated administrator to execute arbitrary commands on the appliance and send specially crafted requests. The remote attacker was able to bypass authentication and be seen as an “authenticated” administrator, and then take advantage of the ability to execute arbitrary commands on the appliance.

By exploiting these vulnerabilities, the threat actor had near total control of the appliance. Among other things, the attacker was able to:

Harvest credentials from users logging into the VPN service
Use these credentials to log into protected systems in search of even more credentials
Modify files to enable remote code execution
Deploy web shells to a number of web servers
Reverse tunnel from the appliance back to their command-and-control server (C2)
Avoid detection by disabling logging and clearing existing logs

Little Appliance, Big Risk

This is a serious incident that is exposing customers to significant risk. CISA is justified in issuing their directive, and Ivanti is working hard to mitigate the threat and develop patches for the software on their appliances. But it also serves as another indictment of the legacy “castle-and-moat” security paradigm. In that paradigm, remote users were outside the castle while protected applications and resources remained inside. The moat, consisting of a layer of security appliances, separated the two. The moat, in this case the Ivanti appliance, was responsible for authenticating and authorizing users, and then connecting them to protected applications and resources. Attackers and other bad actors were blocked at the moat.

This incident shows us what happens when a bad actor is able to take control of the moat itself, and the challenges customers face to recover control. Two typical characteristics of vendor-supplied appliances and the legacy security strategy highlight the risks:

Administrators have access to the internals of the appliance
Authenticated users indiscriminately have access to a wide range of applications and resources on the corporate network, increasing the risk of bad actor lateral movement

A better way: Cloudflare’s SASE platform

Cloudflare One is Cloudflare’s SSE and single-vendor SASE platform. While Cloudflare One spans broadly across security and networking services (and you can read about the latest additions here), I want to focus on the two points noted above.

First, Cloudflare One employs the principles of Zero Trust, including the principle of least privilege. As such, users that authenticate successfully only have access to the resources and applications necessary for their role. This principle also helps in the event of a compromised user account as the bad actor does not have indiscriminate network-level access. Rather, least privilege limits the range of lateral movement that a bad actor has, effectively reducing the blast radius.

Second, while customer administrators need to have access to configure their services and policies, Cloudflare One does not provide any external access to the system internals of Cloudflare’s platform. Without that access, a bad actor would not be able to launch the types of attacks executed when they had access to the internals of the Ivanti appliance.

It’s time to eliminate the legacy VPN

If your organization is impacted by the CISA directive, or you are just ready to modernize and want to augment or replace your current VPN solution, Cloudflare is here to help. Cloudflare’s Zero Trust Network Access (ZTNA) service, part of the Cloudflare One platform, is the fastest and safest way to connect any user to any application.

Contact us to get immediate onboarding help or to schedule an architecture workshop to help you augment or replace your Ivanti (or any) VPN solution.
Not quite ready for a live conversation? Read our learning path article on how to replace your VPN with Cloudflare or our SASE reference architecture for a view of how all of our SASE services and on-ramps work together.

Simplifying how enterprises connect to Cloudflare with Express Cloudflare Network Interconnect

2024-03-06 Ben Ritter

Post Syndicated from Ben Ritter original https://blog.cloudflare.com/announcing-express-cni

We’re excited to announce the largest update to Cloudflare Network Interconnect (CNI) since its launch, and because we’re making CNIs faster and easier to deploy, we’re calling this Express CNI. At the most basic level, CNI is a cable between a customer’s network router and Cloudflare, which facilitates the direct exchange of information between networks instead of via the Internet. CNIs are fast, secure, and reliable, and have connected customer networks directly to Cloudflare for years. We’ve been listening to how we can improve the CNI experience, and today we are sharing more information about how we’re making it faster and easier to order CNIs, and connect them to Magic Transit and Magic WAN.

Interconnection services and what to consider

Interconnection services provide a private connection that allows you to connect your networks to other networks like the Internet, cloud service providers, and other businesses directly. This private connection benefits from improved connectivity versus going over the Internet and reduced exposure to common threats like Distributed Denial of Service (DDoS) attacks.

Cost is an important consideration when evaluating any vendor for interconnection services. The cost of an interconnection is typically comprised of a fixed port fee, based on the capacity (speed) of the port, and the variable amount of data transferred. Some cloud providers also add complex inter-region bandwidth charges.

Other important considerations include the following:

How much capacity is needed?
Are there variable or fixed costs associated with the port?
Is the provider located in the same colocation facility as my business?
Are they able to scale with my network infrastructure?
Are you able to predict your costs without any unwanted surprises?
What additional products and services does the vendor offer?

Cloudflare does not charge a port fee for Cloudflare Network Interconnect, nor do we charge for inter-region bandwidth. Using CNI with products like Magic Transit and Magic WAN may even reduce bandwidth spending with Internet service providers. For example, you can deliver Magic Transit-cleaned traffic to your data center with a CNI instead of via your Internet connection, reducing the amount of bandwidth that you would pay an Internet service provider for.

To underscore the value of CNI, one vendor charges nearly \$20,000 a year for a 10 Gigabit per second (Gbps) direct connect port. The same 10 Gbps CNI on Cloudflare for one year is $0. Their cost also does not include any costs related to the amount of data transferred between different regions or geographies, or outside of their cloud. We have never charged for CNIs, and are committed to making it even easier for customers to connect to Cloudflare, and destinations beyond on the open Internet.

3 Minute Provisioning

Our first big announcement is a new, faster approach to CNI provisioning and deployment. Starting today, all Magic Transit and Magic WAN customers can order CNIs directly from their Cloudflare account. The entire process is about 3 clicks and takes less than 3 minutes (roughly the time to make coffee). We’re going to show you how simple it is to order a CNI.

The first step is to find out whether Cloudflare is in the same data center or colocation facility as your routers, servers, and network hardware. Let’s navigate to the new “Interconnects” section of the Cloudflare dashboard, and order a new Direct CNI.

Search for the city of your data center, and quickly find out if Cloudflare is in the same facility. I’m going to stand up a CNI to connect my example network located in Ashburn, VA.

It looks like Cloudflare is in the same facility as my network, so I’m going to select the location where I’d like to connect.

As of right now, my data center is only exchanging a few hundred Megabits per second of traffic on Magic Transit, so I’m going to select a 1 Gigabit per second interface, which is the smallest port speed available. I can also order a 10 Gbps link if I have more than 1 Gbps of traffic in a single location. Cloudflare also supports 100 Gbps CNIs, but if you have this much traffic to exchange with us, we recommend that you coordinate with your account team.

After selecting your preferred port speed, you can name your CNI, which will be referenceable later when you direct your Magic Transit or Magic WAN traffic to the interconnect. We are given the opportunity to verify that everything looks correct before confirming our CNI order.

Once we click the “Confirm Order” button, Cloudflare will provision an interface on our router for your CNI, and also assign IP addresses for you to configure on your router interface. Cloudflare will also issue you a Letter of Authorization (LOA) for you to order a cross connect with the local facility. Cloudflare will provision a port on our router for your CNI within 3 minutes of your order, and you will be able to ping across the CNI as soon as the interface line status comes up.

After downloading the Letter of Authorization (LOA) to order a cross connect, we’ll navigate back to our Interconnects area. Here we can see the point to point IP addressing, and the CNI name that is used in our Magic Transit or Magic WAN configuration. We can also redownload the LOA if needed.

Simplified Magic Transit and Magic WAN onboarding

Our second major announcement is that Express CNI dramatically simplifies how Magic Transit and Magic WAN customers connect to Cloudflare. Getting packets into Magic Transit or Magic WAN in the past with a CNI required customers to configure a GRE (Generic Routing Encapsulation) tunnel on their router. These configurations are complex, and not all routers and switches support these changes. Since both Magic Transit and Magic WAN protect networks, and operate at the network layer on packets, customers rightly asked us, “If I connect directly to Cloudflare with CNI, why do I also need a GRE tunnel for Magic Transit and Magic WAN?”

Starting today, GRE tunnels are no longer required with Express CNI. This means that Cloudflare supports standard 1500-byte packets on the CNI, and there’s no need for complex GRE or MSS adjustment configurations to get traffic into Magic Transit or Magic WAN. This significantly reduces the amount of configuration required on a router for Magic Transit and Magic WAN customers who can connect over Express CNI. If you’re not familiar with Magic Transit, the key takeaway is that we’ve reduced the complexity of changes you must make on your router to protect your network with Cloudflare.

What’s next for CNI?

We’re excited about how Express CNI simplifies connecting to Cloudflare’s network. Some customers connect to Cloudflare through our Interconnection Platform Partners, like Equinix and Megaport, and we plan to bring the Express CNI features to our partners too.

We have upgraded a number of our data centers to support Express CNI, and plan to upgrade many more over the next few months. We are rapidly expanding the number of global locations that support Express CNI as we install new network hardware. If you’re interested in connecting to Cloudflare with Express CNI, but are unable to find your data center, please let your account team know.

If you’re on an existing classic CNI today, and you don’t need Express CNI features, there is no obligation to migrate to Express CNI. Magic Transit and Magic WAN customers have been asking for BGP support to control how Cloudflare routes traffic back to their networks, and we expect to extend BGP support to Express CNI first, so keep an eye out for more Express CNI announcements later this year.

Get started with Express CNI today

As we’ve demonstrated above, Express CNI makes it fast and easy to connect your network to Cloudflare. If you’re a Magic Transit or Magic WAN customer, the new “Interconnects” area is now available on your Cloudflare dashboard. To deploy your first CNI, you can follow along with the screenshots above, or refer to our updated interconnects documentation.

Cloudflare launches AI Assistant for Security Analytics

2024-03-04 Jen Sells

Post Syndicated from Jen Sells original https://blog.cloudflare.com/security-analytics-ai-assistant

Imagine you are in the middle of an attack on your most crucial production application, and you need to understand what’s going on. How happy would you be if you could simply log into the Dashboard and type a question such as: “Compare attack traffic between US and UK” or “Compare rate limiting blocks for automated traffic with rate limiting blocks from human traffic” and see a time series chart appear on your screen without needing to select a complex set of filters?

Today, we are introducing an AI assistant to help you query your security event data, enabling you to more quickly discover anomalies and potential security attacks. You can now use plain language to interrogate Cloudflare analytics and let us do the magic.

What did we build?

One of the big challenges when analyzing a spike in traffic or any anomaly in your traffic is to create filters that isolate the root cause of an issue. This means knowing your way around often complex dashboards and tools, knowing where to click and what to filter on.

On top of this, any traditional security dashboard is limited to what you can achieve by the way data is stored, how databases are indexed, and by what fields are allowed when creating filters. With our Security Analytics view, for example, it was difficult to compare time series with different characteristics. For example, you couldn’t compare the traffic from IP address x.x.x.x with automated traffic from Germany without opening multiple tabs to Security Analytics and filtering separately. From an engineering perspective, it would be extremely hard to build a system that allows these types of unconstrained comparisons.

With the AI Assistant, we are removing this complexity by leveraging our Workers AI platform to build a tool that can help you query your HTTP request and security event data and generate time series charts based on a request formulated with natural language. Now the AI Assistant does the hard work of figuring out the necessary filters and additionally can plot multiple series of data on a single graph to aid in comparisons. This new tool opens up a new way of interrogating data and logs, unconstrained by the restrictions introduced by traditional dashboards.

Now it is easier than ever to get powerful insights about your application security by using plain language to interrogate your data and better understand how Cloudflare is protecting your business. The new AI Assistant is located in the Security Analytics dashboard and works seamlessly with the existing filters. The answers you need are just a question away.

What can you ask?

To demonstrate the capabilities of AI Assistant, we started by considering the questions that we ask ourselves every day when helping customers to deploy the best security solutions for their applications.

We’ve included some clickable examples in the dashboard to get you started.

You can use the AI Assistant to

Identify the source of a spike in attack traffic by asking: “Compare attack traffic between US and UK”
Identify root cause of 5xx errors by asking: “Compare origin and edge 5xx errors”
See which browsers are most commonly used by your users by asking:”Compare traffic across major web browsers”
For an ecommerce site, understand what percentage of users visit vs add items to their shopping cart by asking: “Compare traffic between /api/login and /api/basket”
Identify bot attacks against your ecommerce site by asking: “Show requests to /api/basket with a bot score less than 20”
Identify the HTTP versions used by clients by asking: “Compare traffic by each HTTP version”
Identify unwanted automated traffic to specific endpoints by asking: “Show POST requests to /admin with a Bot Score over 30”

You can start from these when exploring the AI Assistant.

How does it work?

Using Cloudflare’s powerful Workers AI global network inference platform, we were able to use one of the off-the-shelf large language models (LLMs) offered on the platform to convert customer queries into GraphQL filters. By teaching an AI model about the available filters we have on our Security Analytics GraphQL dataset, we can have the AI model turn a request such as “Compare attack traffic on /api and /admin endpoints” into a matching set of structured filters:

```
[
  {“name”: “Attack Traffic on /api”, “filters”: [{“key”: “clientRequestPath”, “operator”: “eq”, “value”: “/api”}, {“key”: “wafAttackScoreClass”, “operator”: “eq”, “value”: “attack”}]},
  {“name”: “Attack Traffic on /admin”, “filters”: [{“key”: “clientRequestPath”, “operator”: “eq”, “value”: “/admin”}, {“key”: “wafAttackScoreClass”, “operator”: “eq”, “value”: “attack”}]}
]
```

Then, using the filters provided by the AI model, we can make requests to our GraphQL APIs, gather the requisite data, and plot a data visualization to answer the customer query.

By using this method, we are able to keep customer information private and avoid exposing any security analytics data to the AI model itself, while still allowing humans to query their data with ease. This ensures that your queries will never be used to train the model. And because Workers AI hosts a local instance of the LLM on Cloudflare’s own network, your queries and resulting data never leave Cloudflare’s network.

Future Development

We are in the early stages of developing this capability and plan to rapidly extend the capabilities of the Security Analytics AI Assistant. Don’t be surprised if we cannot handle some of your requests at the beginning. At launch, we are able to support basic inquiries that can be plotted in a time series chart such as “show me” or “compare” for any currently filterable fields.

However, we realize there are a number of use cases that we haven’t even thought of, and we are excited to release the Beta version of AI Assistant to all Business and Enterprise customers to let you test the feature and see what you can do with it. We would love to hear your feedback and learn more about what you find useful and what you would like to see in it next. With future versions, you’ll be able to ask questions such as “Did I experience any attacks yesterday?” and use AI to automatically generate WAF rules for you to apply to mitigate them.

Beta availability

Starting today, AI Assistant is available for a select few users and rolling out to all Business and Enterprise customers throughout March. Look out for it and try for free and let us know what you think by using the Feedback link at the top of the Security Analytics page.

Final pricing will be determined prior to general availability.

Cloudflare announces Firewall for AI

2024-03-04 Daniele Molteni

Post Syndicated from Daniele Molteni original https://blog.cloudflare.com/firewall-for-ai

Today, Cloudflare is announcing the development of Firewall for AI, a protection layer that can be deployed in front of Large Language Models (LLMs) to identify abuses before they reach the models.

While AI models, and specifically LLMs, are surging, customers tell us that they are concerned about the best strategies to secure their own LLMs. Using LLMs as part of Internet-connected applications introduces new vulnerabilities that can be exploited by bad actors.

Some of the vulnerabilities affecting traditional web and API applications apply to the LLM world as well, including injections or data exfiltration. However, there is a new set of threats that are now relevant because of the way LLMs work. For example, researchers have recently discovered a vulnerability in an AI collaboration platform that allows them to hijack models and perform unauthorized actions.

Firewall for AI is an advanced Web Application Firewall (WAF) specifically tailored for applications using LLMs. It will comprise a set of tools that can be deployed in front of applications to detect vulnerabilities and provide visibility to model owners. The tool kit will include products that are already part of WAF, such as Rate Limiting and Sensitive Data Detection, and a new protection layer which is currently under development. This new validation analyzes the prompt submitted by the end user to identify attempts to exploit the model to extract data and other abuse attempts. Leveraging the size of Cloudflare network, Firewall for AI runs as close to the user as possible, allowing us to identify attacks early and protect both end user and models from abuses and attacks.

Before we dig into how Firewall for AI works and its full feature set, let’s first examine what makes LLMs unique, and the attack surfaces they introduce. We’ll use the OWASP Top 10 for LLMs as a reference.

Why are LLMs different from traditional applications?

When considering LLMs as Internet-connected applications, there are two main differences compared with more traditional web apps.

First, the way users interact with the product. Traditional apps are deterministic in nature. Think about a bank application — it’s defined by a set of operations (check my balance, make a transfer, etc.). The security of the business operation (and data) can be obtained by controlling the fine set of operations accepted by these endpoints: “GET /balance” or “POST /transfer”.

LLM operations are non-deterministic by design. To start with, LLM interactions are based on natural language, which makes identifying problematic requests harder than matching attack signatures. Additionally, unless a response is cached, LLMs typically provide a different response every time — even if the same input prompt is repeated. This makes limiting the way a user interacts with the application much more difficult. This poses a threat to the user as well, in terms of being exposed to misinformation that weakens the trust in the model.

Second, a big difference is how the application control plane interacts with the data. In traditional applications, the control plane (code) is well separated from the data plane (database). The defined operations are the only way to interact with the underlying data (e.g. show me the history of my payment transactions). This allows security practitioners to focus on adding checks and guardrails to the control plane and thus protecting the database indirectly.

LLMs are different in that the training data becomes part of the model itself through the training process, making it extremely difficult to control how that data is shared as a result of a user prompt. Some architectural solutions are being explored, such as separating LLMs into different levels and segregating data. However, no silver bullet has yet been found.

From a security perspective, these differences allow attackers to craft new attack vectors that can target LLMs and fly under the radar of existing security tools designed for traditional web applications.

OWASP LLM Vulnerabilities

The OWASP foundation released a list of the top 10 classes of vulnerabilities for LLMs, providing a useful framework for thinking about how to secure language models. Some of the threats are reminiscent of the OWASP top 10 for web applications, while others are specific to language models.

Similar to web applications, some of these vulnerabilities can be best addressed when the LLM application is designed, developed, and trained. For example, Training Data Poisoning can be carried out by introducing vulnerabilities in the training data set used to train new models. Poisoned information is then presented to the user when the model is live. Supply Chain Vulnerabilities and Insecure Plugin Design are vulnerabilities introduced in components added to the model, like third-party software packages. Finally, managing authorization and permissions is crucial when dealing with Excessive Agency, where unconstrained models can perform unauthorized actions within the broader application or infrastructure.

Conversely, Prompt Injection, Model Denial of Service, and Sensitive Information Disclosure can be mitigated by adopting a proxy security solution like Cloudflare Firewall for AI. In the following sections, we will give more details about these vulnerabilities and discuss how Cloudflare is optimally positioned to mitigate them.

LLM deployments

Language model risks also depend on the deployment model. Currently, we see three main deployment approaches: internal, public, and product LLMs. In all three scenarios, you need to protect models from abuses, protect any proprietary data stored in the model, and protect the end user from misinformation or from exposure to inappropriate content.

Internal LLMs: Companies develop LLMs to support the workforce in their daily tasks. These are considered corporate assets and shouldn’t be accessed by non-employees. Examples include an AI co-pilot trained on sales data and customer interactions used to generate tailored proposals, or an LLM trained on an internal knowledge base that can be queried by engineers.
Public LLMs: These are LLMs that can be accessed outside the boundaries of a corporation. Often these solutions have free versions that anyone can use and they are often trained on general or public knowledge. Examples include GPT from OpenAI or Claude from Anthropic.
Product LLM: From a corporate perspective, LLMs can be part of a product or service offered to their customers. These are usually self-hosted, tailored solutions that can be made available as a tool to interact with the company resources. Examples include customer support chatbots or Cloudflare AI Assistant.

From a risk perspective, the difference between Product and Public LLMs is about who carries the impact of successful attacks. Public LLMs are considered a threat to data because data that ends up in the model can be accessed by virtually anyone. This is one of the reasons many corporations advise their employees not to use confidential information in prompts for publicly available services. Product LLMs can be considered a threat to companies and their intellectual property if models had access to proprietary information during training (by design or by accident).

Firewall for AI

Cloudflare Firewall for AI will be deployed like a traditional WAF, where every API request with an LLM prompt is scanned for patterns and signatures of possible attacks.

Firewall for AI can be deployed in front of models hosted on the Cloudflare Workers AI platform or models hosted on any other third party infrastructure. It can also be used alongside Cloudflare AI Gateway, and customers will be able to control and set up Firewall for AI using the WAF control plane.

*Firewall for AI works like a traditional web application firewall. It is deployed in front of an LLM application and scans every request to identify attack signatures*

Prevent volumetric attacks

One of the threats listed by OWASP is Model Denial of Service. Similar to traditional applications, a DoS attack is carried out by consuming an exceptionally high amount of resources, resulting in reduced service quality or potentially increasing the costs of running the model. Given the amount of resources LLMs require to run, and the unpredictability of user input, this type of attack can be detrimental.

This risk can be mitigated by adopting rate limiting policies that control the rate of requests from individual sessions, therefore limiting the context window. By proxying your model through Cloudflare today, you get DDoS protection out of the box. You can also use Rate Limiting and Advanced Rate Limiting to manage the rate of requests allowed to reach your model by setting a maximum rate of request performed by an individual IP address or API key during a session.

Identify sensitive information with Sensitive Data Detection

There are two use cases for sensitive data, depending on whether you own the model and data, or you want to prevent users from sending data into public LLMs.

As defined by OWASP, Sensitive Information Disclosure happens when LLMs inadvertently reveal confidential data in the responses, leading to unauthorized data access, privacy violations, and security breaches. One way to prevent this is to add strict prompt validations. Another approach is to identify when personally identifiable information (PII) leaves the model. This is relevant, for example, when a model was trained with a company knowledge base that may include sensitive information, such asPII (like social security number), proprietary code, or algorithms.

Customers using LLM models behind Cloudflare WAF can employ the Sensitive Data Detection (SDD) WAF managed ruleset to identify certain PII being returned by the model in the response. Customers can review the SDD matches on WAF Security Events. Today, SDD is offered as a set of managed rules designed to scan for financial information (such as credit card numbers) as well as secrets (API keys). As part of the roadmap, we plan to allow customers to create their own custom fingerprints.

The other use case is intended to prevent users from sharing PII or other sensitive information with external LLM providers, such as OpenAI or Anthropic. To protect from this scenario, we plan to expand SDD to scan the request prompt and integrate its output with AI Gateway where, alongside the prompt’s history, we detect if certain sensitive data has been included in the request. We will start by using the existing SDD rules, and we plan to allow customers to write their own custom signatures. Relatedly, obfuscation is another feature we hear a lot of customers talk about. Once available, the expanded SDD will allow customers to obfuscate certain sensitive data in a prompt before it reaches the model. SDD on the request phase is being developed.

Preventing model abuses

Model abuse is a broader category of abuse. It includes approaches like “prompt injection” or submitting requests that generate hallucinations or lead to responses that are inaccurate, offensive, inappropriate, or simply off-topic.

Prompt Injection is an attempt to manipulate a language model through specially crafted inputs, causing unintended responses by the LLM. The results of an injection can vary, from extracting sensitive information to influencing decision-making by mimicking normal interactions with the model. A classic example of prompt injection is manipulating a CV to affect the output of resume screening tools.

A common use case we hear from customers of our AI Gateway is that they want to avoid their application generating toxic, offensive, or problematic language. The risks of not controlling the outcome of the model include reputational damage and harming the end user by providing an unreliable response.

These types of abuse can be managed by adding an additional layer of protection that sits in front of the model. This layer can be trained to block injection attempts or block prompts that fall into categories that are inappropriate.

Prompt and response validation

Firewall for AI will run a series of detections designed to identify prompt injection attempts and other abuses, such as making sure the topic stays within the boundaries defined by the model owner. Like other existing WAF features, Firewall for AI will automatically look for prompts embedded in HTTP requests or allow customers to create rules based on where in the JSON body of the request the prompt can be found.

Once enabled, the Firewall will analyze every prompt and provide a score based on the likelihood that it’s malicious. It will also tag the prompt based on predefined categories. The score ranges from 1 to 99 which indicates the likelihood of a prompt injection, with 1 being the most likely.

Customers will be able to create WAF rules to block or handle requests with a particular score in one or both of these dimensions. You’ll be able to combine this score with other existing signals (like bot score or attack score) to determine whether the request should reach the model or should be blocked. For example, it could be combined with a bot score to identify if the request was malicious and generated by an automated source.

*Detecting prompt injections and prompt abuse is part of the scope of Firewall for AI. Early iteration of the product design*

Besides the score, we will assign tags to each prompt that can be used when creating rules to prevent prompts belonging to any of these categories from reaching their model. For example, customers will be able to create rules to block specific topics. This includes prompts using words categorized as offensive, or linked to religion, sexual content, or politics, for example.

How can I use Firewall for AI? Who gets this?

Enterprise customers on the Application Security Advanced offering can immediately start using Advanced Rate Limiting and Sensitive Data Detection (on the response phase). Both products can be found in the WAF section of the Cloudflare dashboard. Firewall for AI’s prompt validation feature is currently under development and a beta version will be released in the coming months to all Workers AI users. Sign up to join the waiting list and get notified when the feature becomes available.

Conclusion

Cloudflare is one of the first security providers launching a set of tools to secure AI applications. Using Firewall for AI, customers can control what prompts and requests reach their language models, reducing the risk of abuses and data exfiltration. Stay tuned to learn more about how AI application security is evolving.

AWS Step Functions Workflow Studio is now available in AWS Application Composer

2023-11-28 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/aws-step-functions-workflow-studio-is-now-available-in-aws-application-composer/

Today, we’re announcing that AWS Step Functions Workflow Studio is now available in AWS Application Composer. This new integration brings together the development of workflows and application resources into a unified visual infrastructure as code (IaC) builder.

Now, you can have a seamless transition between authoring workflows with AWS Step Functions Workflow Studio and defining resources with AWS Application Composer. This announcement allows you to create and manage all resources at any stage of your development journey. You can visualize the full application in AWS Application Composer, then zoom into the workflow details with AWS Step Functions Workflow Studio—all within a single interface.

Seamlessly build workflow and modern application
To help you design and build modern applications, we launched AWS Application Composer in March 2023. With AWS Application Composer, you can use a visual builder to compose and configure serverless applications from AWS services backed by deployment-ready IaC.

In various use cases of building modern applications, you may also need to orchestrate microservices, automate mission-critical business processes, create event-driven applications that respond to infrastructure changes, or build machine learning (ML) pipelines. To solve these challenges, you can use AWS Step Functions, a fully managed service that makes it easier to coordinate distributed application components using visual workflows. To simplify workflow development, in 2021 we introduced AWS Step Functions Workflow Studio, a low-code visual tool for rapid workflow prototyping and development across 12,000+ API actions from over 220 AWS services.

While AWS Step Functions Workflow Studio brings simplicity to building workflows, customers that want to deploy workflows using IaC had to manually define their state machine resource and migrate their workflow definitions to the IaC template.

Better together: AWS Step Functions Workflow Studio in AWS Application Composer
With this new integration, you can now design AWS Step Functions workflows in AWS Application Composer using a drag-and-drop interface. This accelerates the path from prototyping to production deployment and iterating on existing workflows.

You can start by composing your modern application with AWS Application Composer. Within the canvas, you can add a workflow by adding an AWS Step Functions state machine resource. This new capability provides you with the ability to visually design and build a workflow with an intuitive interface to connect workflow steps to resources.

How it works
Let me walk you through how you can use AWS Step Functions Workflow Studio in AWS Application Composer. For this demo, let’s say that I need to improve handling e-commerce transactions by building a workflow and integrating with my existing serverless APIs.

First, I navigate to AWS Application Composer. Because I already have an existing project that includes application code and IaC templates from AWS Application Composer, I don’t need to build anything from scratch.

I open the menu and select Project folder to open the files in my local development machine.

Then, I select the path of my local folder, and AWS Application Composer automatically detects the IaC template that I currently have.

Then, AWS Application Composer visualizes the diagram in the canvas. What I really like about using this approach is that AWS Application Composer activates Local sync mode, which automatically syncs and saves any changes in IaC templates into my local project.

Here, I have a simple serverless API running on Amazon API Gateway, which invokes an AWS Lambda function and integrates with Amazon DynamoDB.

Now, I’m ready to make some changes to my serverless API. I configure another route on Amazon API Gateway and add AWS Step Functions state machine to start building my workflow.

When I configure my Step Functions state machine, I can start editing my workflow by selecting Edit in Workflow Studio.

This opens Step Functions Workflow Studio within the AWS Application Composer canvas. I have the same experience as Workflow Studio in the AWS Step Functions console. I can use the canvas to add actions, flows , and patterns into my Step Functions state machine.

I start building my workflow, and here’s the result that I exported using Export PNG image in Workflow Studio.

But here’s where this new capability really helps me as a developer. In the workflow definition, I use various AWS resources, such as AWS Lambda functions and Amazon DynamoDB. If I need to reference the AWS resources I defined in AWS Application Composer, I can use an AWS CloudFormation substitution.

With AWS CloudFormation substitutions, I can add a substitution using an AWS CloudFormation convention, which is a dynamic reference to a value that is provided in the IaC template. I am using a placeholder substitution here so I can map it with an AWS resource in the AWS Application Composer canvas in a later step.

I can also define the AWS CloudFormation substitution for my Amazon DynamoDB table.

At this stage, I’m happy with my workflow. To review the Amazon States Language as my AWS Step Functions state machine definition, I can also open the Code tab. Now I don’t need to manually copy and paste this definition into IaC templates. I only need to save my work and choose Return to Application Composer.

Here, I can see that my AWS Step Functions state machine is updated both in the visual diagram and in the state machine definition section.

If I scroll down, I will find AWS Cloudformation Definition Substitutions for resources that I defined in Workflow Studio. I can manually replace the mapping here, or I can use the canvas.

To use the canvas, I simply drag and drop the respective resources in my Step Functions state machine and in the Application Composer canvas. Here, I connect the Inventory Process task state with a new AWS Lambda function. Also, my Step Functions state machine tasks can reference existing resources.

When I choose Template, the state machine definition is integrated with other AWS Application Composer resources. With this IaC template I can easily deploy using AWS Serverless Application Model Command Line Interface (AWS SAM CLI) or CloudFormation.

Things to know
Here is some additional information for you:

Pricing – The AWS Step Functions Workflow Studio in AWS Application Composer comes at no additional cost.

Availability – This feature is available in all AWS Regions where Application Composer is available.

AWS Step Functions Workflow Studio in AWS Application Composer provides you with an easy-to-use experience to integrate your workflow into modern applications. Get started and learn more about this feature on the AWS Application Composer page.

Happy building!
— Donnie

External endpoints and testing of task states now available in AWS Step Functions

2023-11-27 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/external-endpoints-and-testing-of-task-states-now-available-in-aws-step-functions/

Now AWS Step Functions HTTPS endpoints let you integrate third-party APIs and external services to your workflows. HTTPS endpoints provide a simpler way of making calls to external APIs and integrating with existing SaaS providers, like Stripe for handling payments, GitHub for code collaboration and repository management, and Salesforce for sales and marketing insights. Before this launch, customers needed to use an AWS Lambda function to call the external endpoint, handling authentication and errors directly from the code.

Also, we are announcing a new capability to test your task states individually without the need to deploy or execute the state machine.

AWS Step Functions is a visual workflow service that makes it easy for developers to build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines. Step Functions integrates with over 220 AWS services and provides features that help developers build, such as built-in error handling, real-time and auditable workflow execution history, and large-scale parallel processing.

HTTPS endpoints
HTTPS endpoints are a new resource for your task states that allow you to connect to third-party HTTP targets outside AWS. Step Functions invokes the HTTP endpoint, deliver a request body, headers, and parameters, and get a response from the third-party services. You can use any preferred HTTP method, such as GET or POST.

HTTPS endpoints use Amazon EventBridge connections to manage the authentication credentials for the target. This defines the authorization type used, which can be a basic authentication with a username and password, an API key, or OAuth. EventBridge connections use AWS Secrets Manager to store the secret. This keeps the secrets out of the state machine, reducing the risks of accidentally exposing your secrets in logs or in the state machine definition.

Getting started with HTTPS endpoints
To get started with HTTPS endpoints, first you need to create an EventBridge connection. Then you need to create a new AWS Identity and Access Management (IAM) role and give permissions so your state machine can access the connection resource, get the secret from Secrets Manager, and get permissions to invoke an HTTP endpoint.

Here are the policies that you need to include in your state machine execution role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:events!connection/*"
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "RetrieveConnectionCredentials",
            "Effect": "Allow",
            "Action": [
                "events:RetrieveConnectionCredentials"
            ],
            "Resource": [
                "arn:aws:events:us-east-2:123456789012:connection/oauth_connection/aeabd89e-d39c-4181-9486-9fe03e6f286a"
            ]
        }
    ]
}
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "InvokeHTTPEndpoint",
            "Effect": "Allow",
            "Action": [
                "states:InvokeHTTPEndpoint"
            ],
            "Resource": [
                "arn:aws:states:us-east-2:123456789012:stateMachine:myStateMachine"
            ]
        }
    ]
}

After you have everything ready, you can create your state machine. In your state machine, add a new task state to call a third-party API. You can configure the API endpoint to point to the third-party URL you need, set the correct HTTP method, pick the connection Amazon Resource Name (ARN) for the connection you created previously as the authentication for that endpoint, and provide a request body if needed. In addition, all these parameters can be set dynamically at runtime from the state JSON input.

Now, making external requests with Step Functions is easy, and you can take advantage of all the configurations that Step Functions provides to handle errors, such as retries for transient errors or momentary service unavailability, and redrive for errors that require longer investigation or resolution time.

Test state
To accelerate feedback cycles, we are also announcing a new capability to test individual states. This new feature allows you to test states independently from the execution of your workflow. This is particularly useful for testing endpoints configuration. You can change the input and test the different scenarios without the need to deploy your workflow or execute the whole state machine. This new feature is available in all task, choice, and pass states.

You will see the testing capability in the Step Functions Workflow Studio when you select a task.

When you choose the Test state, you will be redirected to a different view where you can test the task state. You can test that the state machine role has the right permissions, the endpoint you want to call is correctly configured, and verify that the data manipulations work as expected.

Availability
Now, with all the features that Step Functions provides, it’s never been easier to build state machines that can solve a wide variety of problems, like payment flows, workflows with manual inputs, and integration to legacy systems. Using Step Functions HTTPS endpoints, you can directly integrate with popular payment platforms while ensuring that your users’ credit cards are only charged once and errors are handled automatically. In addition, you can test this new integration even before you deploy the state machine using the new test state feature.

These new features are available in all AWS Regions except Asia Pacific (Hyderabad), Asia Pacific (Melbourne), AWS Israel (Tel Aviv), China, and GovCloud Regions.

To get started you can try the “Generate Invoices using Stripe” sample project from Step Functions in the AWS Managment Console or check out the AWS Step Functions Developer Guide to learn more.

— Marcia

Build generative AI apps using AWS Step Functions and Amazon Bedrock

2023-11-26 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/build-generative-ai-apps-using-aws-step-functions-and-amazon-bedrock/

Today we are announcing two new optimized integrations for AWS Step Functions with Amazon Bedrock. Step Functions is a visual workflow service that helps developers build distributed applications, automate processes, orchestrate microservices, and create data and machine learning (ML) pipelines.

In September, we made available Amazon Bedrock, the easiest way to build and scale generative artificial intelligence (AI) applications with foundation models (FMs). Bedrock offers a choice of foundation models from leading providers like AI21 Labs, Anthropic, Cohere, Stability AI, and Amazon, along with a broad set of capabilities that customers need to build generative AI applications, while maintaining privacy and security. You can use Amazon Bedrock from the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SDKs.

The new Step Functions optimized integrations with Amazon Bedrock allow you to orchestrate tasks to build generative AI applications using Amazon Bedrock, as well as to integrate with over 220 AWS services. With Step Functions, you can visually develop, inspect, and audit your workflows. Previously, you needed to invoke an AWS Lambda function to use Amazon Bedrock from your workflows, adding more code to maintain them and increasing the costs of your applications.

Step Functions provides two new optimized API actions for Amazon Bedrock:

InvokeModel – This integration allows you to invoke a model and run the inferences with the input provided in the parameters. Use this API action to run inferences for text, image, and embedding models.
CreateModelCustomizationJob – This integration creates a fine-tuning job to customize a base model. In the parameters, you specify the foundation model and the location of the training data. When the job is completed, your custom model is ready to be used. This is an asynchronous API, and this integration allows Step Functions to run a job and wait for it to complete before proceeding to the next state. This means that the state machine execution will pause while the create model customization job is running and will resume automatically when the task is complete.

The InvokeModel API action accepts requests and responses that are up to 25 MB. However, Step Functions has a 256 kB limit on state payload input and output. In order to support larger payloads with this integration, you can define an Amazon Simple Storage Service (Amazon S3) bucket where the InvokeModel API reads data from and writes the result to. These configurations can be provided in the parameters section of the API action configuration parameters section.

How to get started with Amazon Bedrock and AWS Step Functions
Before getting started, ensure that you create the state machine in a Region where Amazon Bedrock is available. For this example, use US East (N. Virginia), us-east-1.

From the AWS Management Console, create a new state machine. Search for “bedrock,” and the two available API actions will appear. Drag the InvokeModel to the state machine.

You can now configure that state in the menu on the right. First, you can define which foundation model you want to use. Pick a model from the list, or get the model dynamically from the input.

Then you need to configure the model parameters. You can enter the inference parameters in the text box or load the parameters from Amazon S3.

If you keep scrolling in the API action configuration, you can specify additional configuration options for the API, such as the S3 destination bucket. When this field is specified, the API action stores the API response in the specified bucket instead of returning it to the state output. Here, you can also specify the content type for the requests and responses.

When you finish configuring your state machine, you can create and run it. When the state machine runs, you can visualize the execution details, select the Amazon Bedrock state, and check its inputs and outputs.

Using Step Functions, you can build state machines as extensively as you need, combining different services to solve many problems. For example, you can use Step Functions with Amazon Bedrock to create applications using prompt chaining. This is a technique for building complex generative AI applications by passing multiple smaller and simpler prompts to the FM instead of a very long and detailed prompt. To build a prompt chain, you can create a state machine that calls Amazon Bedrock multiple times to get an inference for each of the smaller prompts. You can use the parallel state to run all these tasks in parallel and then use an AWS Lambda function that unifies the responses of the parallel tasks into one response and generates a result.

Available now
AWS Step Functions optimized integrations for Amazon Bedrock are limited to the AWS Regions where Amazon Bedrock is available.

You can get started with Step Functions and Amazon Bedrock by trying out a sample project from the Step Functions console.

– Marcia

Learn how to streamline and secure your SaaS applications at AWS Applications Innovation Day

2023-06-20 Phil Goldstein

Post Syndicated from Phil Goldstein original https://aws.amazon.com/blogs/aws/learn-how-to-streamline-and-secure-your-saas-applications-at-aws-applications-innovation-day/

Companies continue to adopt software as a service (SaaS) applications at a rapid clip, with recent research showing that the average SaaS portfolio now has at least 200 applications. While organizations purchase these purpose-built tools to make their employees more productive, they now must contend with growing security complexities, context switching, and data silos.

If your company faces these issues, or you want to avoid them in the future, join us on Tuesday, June 27, for a free-to-attend online event AWS Applications Innovation Day. AWS will stream the event simultaneously across multiple platforms, including LinkedIn Live, Twitter, YouTube, and Twitch. You can also join us in person in Seattle to hear from Dilip Kumar, Vice President of AWS Applications and an executive panel with AWS Partners Splunk, Asana, and Okta.

Applications Innovation Day is designed to give you the tools you need to improve how your organization uses and secures SaaS applications. Sessions throughout the day will show you how you can secure data while providing your employees with the best tools for the job. You’ll also learn how to support the right mix of applications to improve workforce collaboration, and how to use generative artificial intelligence securely and effectively to improve insights and enhance employee productivity.

We’ll start the virtual broadcast with a keynote from Dilip Kumar, Vice President of AWS Applications, who will discuss the way we use and govern SaaS applications at AWS. He’ll also discuss how we’ll make it easier to deploy purpose-built SaaS applications like Asana, Okta, Splunk, Zoom, and others across your business, including the announcement of some exciting new innovations from AWS.

AWS product leaders will present technical breakout sessions during the day on the productivity and security aspects of managing a SaaS application tech stack. Sessions will cover a wide range of topics, including how the nature of productivity at work is changing, how AI is transforming SaaS applications and collaboration, how you can improve your security observability across your applications, and how you can create custom analytics on SaaS application activity.

Overall, the event is a great opportunity for security leaders, IT administrators and operations leaders, and anyone leading digital workplace and transformation initiatives to learn how to better leverage and govern SaaS applications.

To register for AWS Applications Innovation Day, simply go to the event page.

How CyberCRX cut ML processing time from 8 days to 56 minutes with AWS Step Functions Distributed Map

2023-04-27 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/how-cybercrx-cut-ml-processing-time-from-8-days-to-56-minutes-with-aws-step-functions-distributed-map/

Last December, Sébastien Stormacq wrote about the availability of a distributed map state for AWS Step Functions, a new feature that allows you to orchestrate large-scale parallel workloads in the cloud. That’s when Charles Burton, a data systems engineer for a company called CyberGRX, found out about it and refactored his workflow, reducing the processing time for his machine learning (ML) processing job from 8 days to 56 minutes. Before, running the job required an engineer to constantly monitor it; now, it runs in less than an hour with no support needed. In addition, the new implementation with AWS Step Functions Distributed Map costs less than what it did originally.

What CyberGRX achieved with this solution is a perfect example of what serverless technologies embrace: letting the cloud do as much of the undifferentiated heavy lifting as possible so the engineers and data scientists have more time to focus on what’s important for the business. In this case, that means continuing to improve the model and the processes for one of the key offerings from CyberGRX, a cyber risk assessment of third parties using ML insights from its large and growing database.

What’s the business challenge?
CyberGRX shares third-party cyber risk (TPCRM) data with their customers. They predict, with high confidence, how a third-party company will respond to a risk assessment questionnaire. To do this, they have to run their predictive model on every company in their platform; they currently have predictive data on more than 225,000 companies. Whenever there’s a new company or the data changes for a company, they regenerate their predictive model by processing their entire dataset. Over time, CyberGRX data scientists improve the model or add new features to it, which also requires the model to be regenerated.

The challenge is running this job for 225,000 companies in a timely manner, with as few hands-on resources as possible. The job runs a set of operations for each company, and every company calculation is independent of other companies. This means that in the ideal case, every company can be processed at the same time. However, implementing such a massive parallelization is a challenging problem to solve.

First iteration
With that in mind, the company built their first iteration of the pipeline using Kubernetes and Argo Workflows, an open-source container-native workflow engine for orchestrating parallel jobs on Kubernetes. These were tools they were familiar with, as they were already using them in their infrastructure.

But as soon as they tried to run the job for all the companies on the platform, they ran up against the limits of what their system could handle efficiently. Because the solution depended on a centralized controller, Argo Workflows, it was not robust, and the controller was scaled to its maximum capacity during this time. At that time, they only had 150,000 companies. And running the job with all of the companies took around 8 days, during which the system would crash and need to be restarted. It was very labor intensive, and it always required an engineer on call to monitor and troubleshoot the job.

The tipping point came when Charles joined the Analytics team at the beginning of 2022. One of his first tasks was to do a full model run on approximately 170,000 companies at that time. The model run lasted the whole week and ended at 2:00 AM on a Sunday. That’s when he decided their system needed to evolve.

Second iteration
With the pain of the last time he ran the model fresh in his mind, Charles thought through how he could rewrite the workflow. His first thought was to use AWS Lambda and SQS, but he realized that he needed an orchestrator in that solution. That’s why he chose Step Functions, a serverless service that helps you automate processes, orchestrate microservices, and create data and ML pipelines; plus, it scales as needed.

Charles got the new version of the workflow with Step Functions working in about 2 weeks. The first step he took was adapting his existing Docker image to run in Lambda using Lambda’s container image packaging format. Because the container already worked for his data processing tasks, this update was simple. He scheduled Lambda provisioned concurrency to make sure that all functions he needed were ready when he started the job. He also configured reserved concurrency to make sure that Lambda would be able to handle this maximum number of concurrent executions at a time. In order to support so many functions executing at the same time, he raised the concurrent execution quota for Lambda per account.

And to make sure that the steps were run in parallel, he used Step Functions and the map state. The map state allowed Charles to run a set of workflow steps for each item in a dataset. The iterations run in parallel. Because Step Functions map state offers 40 concurrent executions and CyberGRX needed more parallelization, they created a solution that launched multiple state machines in parallel; in this way, they were able to iterate fast across all the companies. Creating this complex solution, required a preprocessor that handled the heuristics of the concurrency of the system and split the input data across multiple state machines.

This second iteration was already better than the first one, as now it was able to finish the execution with no problems, and it could iterate over 200,000 companies in 90 minutes. However, the preprocessor was a very complex part of the system, and it was hitting the limits of the Lambda and Step Functions APIs due to the amount of parallelization.

Third and final iteration
Then, during AWS re:Invent 2022, AWS announced a distributed map for Step Functions, a new type of map state that allows you to write Step Functions to coordinate large-scale parallel workloads. Using this new feature, you can easily iterate over millions of objects stored in Amazon Simple Storage Service (Amazon S3), and then the distributed map can launch up to 10,000 parallel sub-workflows to process the data.

When Charles read in the News Blog article about the 10,000 parallel workflow executions, he immediately thought about trying this new state. In a couple of weeks, Charles built the new iteration of the workflow.

Because the distributed map state split the input into different processors and handled the concurrency of the different executions, Charles was able to drop the complex preprocessor code.

The new process was the simplest that it’s ever been; now whenever they want to run the job, they just upload a file to Amazon S3 with the input data. This action triggers an Amazon EventBridge rule that targets the state machine with the distributed map. The state machine then executes with that file as an input and publishes the results to an Amazon Simple Notification Service (Amazon SNS) topic.

What was the impact?
A few weeks after completing the third iteration, they had to run the job on all 227,000 companies in their platform. When the job finished, Charles’ team was blown away; the whole process took only 56 minutes to complete. They estimated that during those 56 minutes, the job ran more than 57 billion calculations.

The following image shows an Amazon CloudWatch graph of the concurrent executions for one Lambda function during the time that the workflow was running. There are almost 10,000 functions running in parallel during this time.

Simplifying and shortening the time to run the job opens a lot of possibilities for CyberGRX and the data science team. The benefits started right away the moment one of the data scientists wanted to run the job to test some improvements they had made for the model. They were able to run it independently without requiring an engineer to help them.

And, because the predictive model itself is one of the key offerings from CyberGRX, the company now has a more competitive product since the predictive analysis can be refined on a daily basis.

Learn more about using AWS Step Functions:

You can also check the Serverless Workflows Collection that we have available in Serverless Land for you to test and learn more about this new capability.

— Marcia

Serverless ICYMI Q4 2022

2023-01-03 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/compute/serverless-icymi-q4-2022/

Welcome to the 20th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!In case you missed our last ICYMI, check out what happened last quarter here.

AWS Lambda

For developers using Java, AWS Lambda has introduced Lambda SnapStart. SnapStart is a new capability that can improve the start-up performance of functions using Corretto (java11) runtime by up to 10 times, at no extra cost.

To use this capability, you must enable it in your function and then publish a new version. This triggers the optimization process. This process initializes the function, takes an immutable, encrypted snapshot of the memory and disk state, and caches it for reuse. When the function is invoked, the state is retrieved from the cache in chunks, on an as-needed basis, and it is used to populate the execution environment.

The ICYMI: Serverless pre:Invent 2022 post shares some of the launches for Lambda before November 21, like the support of Lambda functions using Node.js 18 as a runtime, the Lambda Telemetry API, and new .NET tooling to support .NET 7 applications.

Also, now Amazon Inspector supports Lambda functions. You can enable Amazon Inspector to scan your functions continually for known vulnerabilities. The log4j vulnerability shows how important it is to scan your code for vulnerabilities continuously, not only after deployment. Vulnerabilities can be discovered at any time, and with Amazon Inspector, your functions and layers are rescanned whenever a new vulnerability is published.

AWS Step Functions

There were many new launches for AWS Step Functions, like intrinsic functions, cross-account access capabilities, and the new executions experience for Express Workflows covered in the pre:Invent post.

During AWS re:Invent this year, we announced Step Functions Distributed Map. If you need to process many files, or items inside CSV or JSON files, this new flow can help you. The new distributed map flow orchestrates large-scale parallel workloads.

This feature is optimized for files stored in Amazon S3. You can either process in parallel multiple files stored in a bucket, or process one large JSON or CSV file, in which each line contains an independent item. For example, you can convert a video file into multiple .gif animations using a distributed map, or process over 37 GB of aggregated weather data to find the highest temperature of the day.

Amazon EventBridge

Amazon EventBridge launched two major features: Scheduler and Pipes. Amazon EventBridge Scheduler allows you to create, run, and manage scheduled tasks at scale. You can schedule one-time or recurring tasks across 270 services and over 6.000 APIs.

Amazon EventBridge Pipes allows you to create point-to-point integrations between event producers and consumers. With Pipes you can now connect different sources, like Amazon Kinesis Data Streams, Amazon DynamoDB Streams, Amazon SQS, Amazon Managed Streaming for Apache Kafka, and Amazon MQ to over 14 targets, such as Step Functions, Kinesis Data Streams, Lambda, and others. It not only allows you to connect these different event producers to consumers, but also provides filtering and enriching capabilities for events.

EventBridge now supports enhanced filtering capabilities including:

Matching against characters at the end of a value (suffix filtering)
Ignoring case sensitivity (equals-ignore-case)
OR matching: A single rule can match if any conditions across multiple separate fields are true.

It’s now also simpler to build rules, and you can generate AWS CloudFormation from the console pages and generate event patterns from a schema.

AWS Serverless Application Model (AWS SAM)

There were many announcements for AWS SAM during this quarter summarized in the ICMYI: Serverless pre:Invent 2022 post, like AWS SAM Connectors, SAM CLI Pipelines now support OpenID Connect Protocol, and AWS SAM CLI Terraform support.

AWS Application Composer

AWS Application Composer is a new visual designer that you can use to build serverless applications using multiple AWS services. This is ideal if you want to build a prototype, review with others architectures, generate diagrams for your projects, or onboard new team members to a project.

Within a simple user interface, you can drag and drop the different AWS resources and configure them visually. You can use AWS Application Composer together with AWS SAM Accelerate to build and test your applications in the AWS Cloud.

AWS Serverless digital learning badges

The new AWS Serverless digital learning badges let you show your AWS Serverless knowledge and skills. This is a verifiable digital badge that is aligned with the AWS Serverless Learning Plan.

This badge proves your knowledge and skills for Lambda, Amazon API Gateway, and designing serverless applications. To earn this badge, you must score at least 80 percent on the assessment associated with the Learning Plan. Visit this link if you are ready to get started learning or just jump directly to the assessment.

News from other services:

Amazon SNS

Amazon SQS

AWS AppSync and AWS Amplify

Observability

AWS re:Invent 2022

AWS re:Invent was held in Las Vegas from November 28 to December 2, 2022. Werner Vogels, Amazon’s CTO, highlighted event-driven applications during his keynote. He stated that the world is asynchronous and showed how strange a synchronous world would be. During the keynote, he showcased Serverlesspresso as an example of an event-driven application. The Serverless DA team presented many breakouts, workshops, and chalk talks. Rewatch all our breakout content:

In addition, we brought Serverlesspresso back to Vegas. Serverlesspresso is a contactless, serverless order management system for a physical coffee bar. The architecture comprises several serverless apps that support an ordering process from a customer’s smartphone to a real espresso bar. The customer can check the virtual line, place an order, and receive a notification when their drink is ready for pickup.

Serverless blog posts

October

November

Nov 7 – Server-side rendering micro-frontends – the architecture
Nov 9 – Enriching operational events with AWS Serverless
Nov 10 – Introducing the AWS Lambda Telemetry API
Nov 10 – Introducing Amazon EventBridge Scheduler
Nov 15 – Better together: AWS SAM CLI and HashiCorp Terraform
Nov 15 – Building serverless .NET applications on AWS Lambda using .NET 7
Nov 17 – Introducing attribute-based access controls (ABAC) for Amazon SQS
Nov 18 – Node.js 18.x runtime now available in AWS Lambda
Nov 18 – Introducing cross-account access capabilities for AWS Step Functions
Nov 18 – Using the AWS Parameter and Secrets Lambda extension to cache parameters and secrets
Nov 18 – Introducing container, database, and queue utilization metrics for the Amazon MWAA environment
Nov 21 – ICYMI: Serverless pre:Invent 2022
Nov 22 – Introducing payload-based message filtering for Amazon SNS
Nov 29 – Starting up faster with AWS Lambda SnapStart
Nov 29 – Reducing Java cold starts on AWS Lambda functions with SnapStart
Nov 29 – Introducing new AWS Serverless digital learning badges

December

Dec 1 – Visualize and create your serverless workloads with AWS Application Composer
Dec 6 – Introducing Serverlesspresso Extensions
Dec 7 – Securing Lambda Function URLs using Amazon Cognito, Amazon CloudFront and AWS WAF
Dec 13 – Visualizing the impact of AWS Lambda code updates
Dec 14 – Chaos experiments using AWS Step Functions and AWS Fault Injection Simulator
Dec 15 – Architecture patterns for consuming private APIs cross-account

Videos

Serverless Office Hours – Tuesday 10 AM PT

Weekly live virtual office hours: In each session, we talk about a specific topic or technology related to serverless and open it up to helping with your real serverless challenges and issues. Ask us anything about serverless technologies and applications.

YouTube: youtube.com/serverlessland

Twitch: twitch.tv/aws

FooBar Serverless YouTube Channel

Marcia Villalba frequently publishes new videos on her popular FooBar Serverless YouTube channel.

October

Oct 1 – Pub/Sub – Amazon SNS – Exploring Event-Driven Patterns
Oct 13 – Event Streaming – Amazon Kinesis – Exploring Event-Driven Patterns

November

Nov 3 – Why to build a multi-Region serverless application and why not?
Nov 10 – Read and Write Data patterns for Multi-Region architectures
Nov 17 – Understanding AWS Lambda scaling and throughput
Nov 24 – Amazon EventBridge Scheduler | Schedule tasks over +200 targets!

December

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. If you want to learn more about event-driven architectures, read our new guide that will help you get started.

You can also follow the Serverless Developer Advocacy team on Twitter and LinkedIn to see the latest news, follow conversations, and interact with the team.

James Beswick: Twitter and LinkedIn
David Boyne: Twitter and LinkedIn
Eric Johnson: Twitter and LinkedIn
Ben Smith: Twitter and LinkedIn
Marcia Villalba: Twitter and LinkedIn
Julian Wood: Twitter and LinkedIn

For more serverless learning resources, visit Serverless Land.

Now — AWS Step Functions Supports 200 AWS Services To Enable Easier Workflow Automation

2021-09-30 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/now-aws-step-functions-supports-200-aws-services-to-enable-easier-workflow-automation/

Today AWS Step Functions expands the number of supported AWS services from 17 to over 200 and AWS API Actions from 46 to over 9,000 with its new capability AWS SDK Service Integrations.

When developers build distributed architectures, one of the patterns they use is the workflow-based orchestration pattern. This pattern is helpful for workflow automation inside a service to perform distributed transactions. An example of a distributed transaction is all the tasks required to handle an order and keep track of the transaction status at all times.

Step Functions is a low-code visual workflow service used for workflow automation, to orchestrate services, and help you to apply this pattern. Developers use Step Functions with managed services such as Artificial Intelligence services, Amazon Simple Storage Service (Amazon S3), and Amazon DynamoDB.

Introducing Step Functions AWS SDK Service Integrations
Until today, when developers were building workflows that integrate with AWS services, they had to choose from the 46 supported services integrations that Step Functions provided. If the service integration was not available, they had to code the integration in an AWS Lambda function. This is not ideal as it added more complexity and costs to the application.

Now with Step Functions AWS SDK Service Integrations, developers can integrate their state machines directly to AWS service that has AWS SDK support.

You can create state machines that use AWS SDK Service Integrations with Amazon States Language (ASL), AWS Cloud Development Kit (AWS CDK), or visually using AWS Step Function Workflow Studio. To get started, create a new Task state. Then call AWS SDK services directly from the ASL in the resource field of a task state. To do this, use the following syntax.

arn:aws:states:::aws-sdk:serviceName:apiAction.[serviceIntegrationPattern]

Let me show you how to get started with a demo.

Demo
In this demo, you are building an application that, when given a video file stored in S3, transcribes it and translates from English to Spanish.

Let’s build this demo with Step Functions. The state machine, with the service integrations, integrates directly to S3, Amazon Transcribe, and Amazon Translate. The API for transcribing is asynchronous. To verify that the transcribing job is completed, you need a polling loop, which waits for it to be ready.

Create the state machine
To follow this demo along, you need to complete these prerequisites:

An S3 bucket where you will put the original file that you want to process
A video or audio file in English stored in that bucket
An S3 bucket where you want the processing to happen

I will show you how to do this demo using the AWS Management Console. If you want to deploy this demo as infrastructure as code, deploy the AWS CloudFormation template for this project.

To get started with this demo, create a new standard state machine. Choose the option Write your workflow in code to build the state machine using ASL. Create a name for the state machine and create a new role.

Start a transcription job
To get started working on the state machine definition, you can Edit the state machine.

The following piece of ASL code is a state machine with two tasks that are using the new AWS SDK Service Integrations capability. The first task is copying the file from one S3 bucket to another, and the second task is starting the transcription job by directly calling Amazon Transcribe.

For using this new capability from Step Functions, the state type needs to be a Task. You need to specify the service name and API action using this syntax: “arn:aws:states:::aws-sdk:serviceName:apiAction.<serviceIntegrationPattern>”. Use camelCase for apiAction names in the Resource field, such as “copyObject”, and use PascalCase for parameter names in the Parameters field, such as “CopySource”.

For the parameters, find the name and required parameters in the AWS API documentation for this service and API action.

{
  "Comment": "A State Machine that process a video file",
  "StartAt": "GetSampleVideo",
  "States": {
    "GetSampleVideo": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:s3:copyObject",
      "Parameters": {
        "Bucket.$": "$.S3BucketName",
        "Key.$": "$.SampleDataInputKey",
        "CopySource.$": "States.Format('{}/{}',$.SampleDataBucketName,$.SampleDataInputKey)"
      },
      "ResultPath": null,
      "Next": "StartTranscriptionJob"
    },
    "StartTranscriptionJob": {
      "Type": "Task",
      "Resource": "arn:aws:states:::aws-sdk:transcribe:startTranscriptionJob",
      "Parameters": {
        "Media": {
          "MediaFileUri.$": "States.Format('s3://{}/{}',$.S3BucketName,$.SampleDataInputKey)"
        },
        "TranscriptionJobName.$": "$$.Execution.Name",
        "LanguageCode": "en-US",
        "OutputBucketName.$": "$.S3BucketName",
        "OutputKey": "transcribe.json"
      },
      "ResultPath": "$.transcription",
      "End": true
    }
  }
}

In the previous piece of code, you can see an interesting use case of the intrinsic functions that ASL provides. You can construct a string using different parameters. Using intrinsic functions in combination with AWS SDK Service Integrations allows you to manipulate data without the needing a Lambda function. For example, this line:

"MediaFileUri.$": "States.Format('s3://{}/{}',$.S3BucketName,$.SampleDataInputKey)"

Give permissions to the state machine
If you start the execution of the state machine now, it will fail. This state machine doesn’t have permissions to access the S3 buckets or use Amazon Transcribe. Step Functions can’t autogenerate IAM policies for most AWS SDK Service Integrations, so you need to add those to the role manually.

Add those permissions to the IAM role that was created for this state machine. You can find a quick link to the role in the state machine details. Attach the “AmazonTranscribeFullAccess” and the “AmazonS3FullAccess” policies to the role.

Running the state machine for the first time
Now that the permissions are in place, you can run this state machine. This state machine takes as an input the S3 bucket name where the original video is uploaded, the name for the file and the name of the S3 bucket where you want to store this file and do all the processing.

For this to work, this file needs to be a video or audio file and it needs to be in English. When the transcription job is done, it saves the result in the bucket you specify in the input with the name transcribe.json.

 {
  "SampleDataBucketName": "<name of the bucket where the original file is>",
  "SampleDataInputKey": "<name of the original file>",
  "S3BucketName": "<name of the bucket where the processing will happen>"
}

As StartTranscriptionJob is an asynchronous call, you won’t see the results right away. The state machine is only calling the API, and then it completes. You need to wait until the transcription job is ready and then see the results in the output bucket in the file transcribe.json.

Adding a polling loop
Because you want to translate the text using your transcriptions results, your state machine needs to wait for the transcription job to complete. For building an API poller in a state machine, you can use a Task, Wait, and Choice state.

Task state gets the job status. In your case, it is calling the service Amazon Transcribe and the API getTranscriptionJob.
Wait state waits for 20 seconds, as the transcription job’s length depends on the size of the input file.
Choice state moves to the right step based on the result of the job status. If the job is completed, it moves to the next step in the machine, and if not, it keeps on waiting.

Wait state
The first of the states you are going to add is the Wait state. This is a simple state that waits for 20 seconds.

"Wait20Seconds": {
        "Type": "Wait",
        "Seconds": 20,
        "Next": "CheckIfTranscriptionDone"
      },

Task state
The next state to add is the Task state, which calls the API getTranscriptionJob. For calling this API, you need to pass the transcription job name. This state returns the job status that is the input of the Choice state.

"CheckIfTranscriptionDone": {
        "Type": "Task",
        "Resource": "arn:aws:states:::aws-sdk:transcribe:getTranscriptionJob",
        "Parameters": {
          "TranscriptionJobName.$": "$.transcription.TranscriptionJob.TranscriptionJobName"
        },
        "ResultPath": "$.transcription",
        "Next": "IsTranscriptionDone?"
      },

Choice state
The Choice state has one rule that checks if the transcription job status is completed. If that rule is true, then it goes to the next state. If not, it goes to the Wait state.

 "IsTranscriptionDone?": {
        "Type": "Choice",
        "Choices": [
          {
            "Variable": "$.transcription.TranscriptionJob.TranscriptionJobStatus",
            "StringEquals": "COMPLETED",
            "Next": "GetTranscriptionText"
          }
        ],
        "Default": "Wait20Seconds"
      },

Getting the transcription text
In this step you are extracting only the transcription text from the output file returned by the transcription job. You need only the transcribed text, as the result file has a lot of metadata that makes the file too long and confusing to translate.

This is a step that you would generally do with a Lambda function. But you can do it directly from the state machine using ASL.

First you need to create a state using AWS SDK Service Integration that gets the result file from S3. Then use another ASL intrinsic function to convert the file text from a String to JSON.

In the next state you can process the file as a JSON object. This state is a Pass state, which cleans the output from the previous state to get only the transcribed text.

 "GetTranscriptionText": {
        "Type": "Task",
        "Resource": "arn:aws:states:::aws-sdk:s3:getObject",
        "Parameters": {
          "Bucket.$": "$.S3BucketName",
          "Key": "transcribe.json"
        },
        "ResultSelector": {
          "filecontent.$": "States.StringToJson($.Body)"
        },
        "ResultPath": "$.transcription",
        "Next": "PrepareTranscriptTest"
      },
  
      "PrepareTranscriptTest" : {
        "Type": "Pass",
        "Parameters": {
          "transcript.$": "$.transcription.filecontent.results.transcripts[0].transcript"
        },
        "Next": "TranslateText"
      },

Translating the text
After preparing the transcribed text, you can translate it. For that you will use Amazon Translate API translateText directly from the state machine. This will be the last state for the state machine and it will return the translated text in the output of this state.

"TranslateText": {
        "Type": "Task",
        "Resource": "arn:aws:states:::aws-sdk:translate:translateText",
        "Parameters": {
          "SourceLanguageCode": "en",
          "TargetLanguageCode": "es",
          "Text.$": "$.transcript"
         },
         "ResultPath": "$.translate",
        "End": true
      }

Add the permissions to the state machine to call the Translate API, by attaching the managed policy “TranslateReadOnly”.

Now with all these in place, you can run your state machine. When the state machine finishes running, you will see the translated text in the output of the last state.

Important things to know
Here are some things that will help you to use AWS SDK Service Integration:

Call AWS SDK services directly from the ASL in the resource field of a task state. To do this, use the following syntax: arn:aws:states:::aws-sdk:serviceName:apiAction.[serviceIntegrationPattern]
Use camelCase for apiAction names in the Resource field, such as “copyObject”, and use PascalCase for parameter names in the Parameters field, such as “CopySource”.
Step Functions can’t autogenerate IAM policies for most AWS SDK Service Integrations, so you need to add those to the IAM role of the state machine manually.
Take advantage of ASL intrinsic functions, as those allow you to manipulate the data and avoid using Lambda functions for simple transformations.

Get started today!
AWS SDK Service Integration is generally available in the following regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Canada (Central), Europe (Ireland), Europe (Milan), Africa (Cape Town) and Asia Pacific (Tokyo). It will be generally available in all other commercial regions where Step Functions is available in the coming days.

Learn more about this new capability by reading its documentation.

— Marcia

New – AWS Step Functions Workflow Studio – A Low-Code Visual Tool for Building State Machines

2021-06-17 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/new-aws-step-functions-workflow-studio-a-low-code-visual-tool-for-building-state-machines/

AWS Step Functions allow you to build scalable, distributed applications using state machines. Until today, building workflows on Step Functions required you to learn and understand Amazon State Language (ASL). Today, we are launching Workflow Studio, a low-code visual tool that helps you learn Step Functions through a guided interactive interface and allows you to prototype and build workflows faster.

In December 2016, when Step Functions was launched, I was in the middle of a migration to serverless. My team moved all the business logic from applications that were built for a traditional environment to a serverless architecture. Although we tried to have functions that did one thing and one thing only, when we put all the state management from our applications into the functions, they became very complex. When I saw that Step Functions was launched, I realized they would reduce the complexity of the serverless application we were building. The downside was that I spent a lot of time learning and writing state machines using ASL, learning how to invoke different AWS services, and performing the flow operations the state machine required. It took weeks of work and lots of testing to get it right.

Step Functions is amazing for visualizing the processes inside your distributed applications, but developing those state machines is not a visual process. Workflow Studio makes it easy for developers to build serverless workflows. It empowers developers to focus on their high-value business logic while reducing the time spent writing configuration code for workflow definitions and building data transformations.

Workflow Studio is great for developers who are new to Step Functions, because it reduces the time to build their first workflow and provides an accelerated learning path where developers learn by doing. Workflow Studio is also useful for developers who are experienced in building workflows, because they can now develop them faster using a visual tool. For example, you can use Workflow Studio to do prototypes of the workflows and share them with your stakeholders quickly. Or you can use Workflow Studio to design the boilerplate of your state machine. When you use Workflow Studio, you don’t need to have all the resources deployed in your AWS account. You can build the state machines and start completing them with the different actions as they get ready.

Workflow Studio simplifies the building of enterprise applications such as ecommerce platforms, financial transaction processing systems, or e-health services. It abstracts away the complexities of building fault-tolerant, scalable applications by assembling AWS services into workflows. Because Workflow Studio exposes many of the capabilities of AWS services in a visual workflow, it’s easy to sequence and configure calls to AWS services and APIs and transform the data flowing through a workflow.

Build a workflow using Workflow Studio
Imagine that you need to build a system that validates data when an account is created. If the input data is correct, the system saves the record in persistent storage and an email is sent to the administrator to confirm the account was created successfully. If the account cannot be created due to a validation error, the data is not stored and an email is sent to notify the administrator that there was a problem with the creation of the account.

There are many ways to solve this problem, but if you want to make the application with the least amount of code, and take advantage of all the managed services that AWS provides, you should use Workflow Studio to design the state machine and build the integrations with all the managed services.

Let me show you how easy is to create a state machine using Workflow Studio. To get started, go to the Step Functions console and create a state machine. You will see an option to start designing the new state machine visually with Workflow Studio.

You can start creating state machines in Workflow Studio. In the left pane, the States Browser, you can view and search the available actions and flow states. Actions are operations you can perform using AWS services, like invoking an AWS Lambda function, making a request with Amazon API Gateway, and sending a message to an Amazon Simple Notification Service (SNS) topic. Flows are the state types you can use to make a workflow appropriate for your use case.

Here are some of the available flow states:

Choice: Adds if-then-else logic.
Parallel: Adds parallel branches.
Map: Adds a for-each loop.
Wait: Delays for a specific time.

In the center of the page, you can see the state machine you are currently working on.

To build the account validator workflow, you need:

One task that invokes a Lambda function that validates the data provided to create the account.
One task that puts an item into a DynamoDB table.
Two tasks that put a message to an SNS topic.
One choice flow state, to decide which action to take, depending on the results of a Lambda function.

When creating the workflow, you don’t need to have all the AWS resources in advance to start working on the state machine. You can build the state machine and then you can add the definitions to the resources later. Or, as we are going to do in this blog post, you can have all your AWS resources deployed in your AWS account before you start working on your state machine. You can deploy the required resources into your AWS account from this Serverless Application Model template. After you create and deploy those resources, you can continue with the other steps in this post.

Configure the Lambda function
The first step in your workflow is the Lambda function. To add it to your state machine, just drag an Invoke action from the Actions list into the center of Workflow Studio, as shown in step 1. You can edit the configuration of your function in the right pane. For example, you can change the name (as shown in step 2). You can also edit which Lambda function should be invoked from the list of functions deployed in this account, as shown in step 3. When you’re done, you can edit the output for this task, as shown in step 4.

Configuring the output of the task is very important, because these values will be passed to the next state as input. We will construct a result object with just the information we need (in this case, if the account is valid). First, clear Filter output with OutputPath, as shown in step 1. Then you can select Transform result with Result Selector, and add the JSON shown in step 2. Then, to combine the input of this current state with the output, and send it to the next state as input, select Combine input and result with ResultPath, as shown in step 3. We need the input of this state, because the input is the account information. If the validation is successful, we need to store that data in a DynamoDB table.

If need help understanding what each of the transformations does, choose the Info links in each of the transformations.

Configure the choice state
After you configure the Lambda function, you need to add a choice state. A choice will validate the input using choice rules. Based on the result of applying those rules, the state machine will direct the execution to a different path.

The following figure shows the workflow for adding a choice state. In step 1, you drag it from the flow menu. In step 2, you enter a name for it. In step 3, you can define the rules. For this use case, you will have one rule with a specific condition.

The condition for this rule compares the results of the output of the previous state against a boolean constant. If the previous state operation returns a value of true, the rule is executed. This is your happy path. In this example, you want to validate the result of the Lambda function. If the function validates the input data, it returns validated is equals to true, as shown here.

If the rule doesn’t apply, the choice state makes the default branch run. This is your error path.

Configure the error path
When there is an error, you want to send an email to let the administrator know that the account couldn’t be created. You should have created an SNS topic earlier in the post. Make sure that the email address you configured in the SNS topic accepts the email subscription for this topic.

To add the SNS task of publishing a message, first search for SNS:Publish task as shown in step 1, and then drag it to the state machine, as shown in step 2. Drag a Fail state flow to the state machine, as shown in step 3, so that when this branch of execution is complete, the state machine is in a fail state.

One nice feature of Workflow Studio is that you can drag the different states around in the state machine and place them in different parts of the worklow.

Now you can configure the SNS task for publishing a message. First, change the state name, as shown in step 4. Choose the topic from the ones deployed in your AWS account, as shown in step 5. Finally, change the message that will be sent in the email to something appropriate for your use case, as shown in step 6.

Configure the happy path
For the happy path, you want to store the account information in a DynamoDB table and then send an email using the SNS topic you deployed earlier. To do that, add the DynamoDB:PutItem task, as shown in step 1, and the SNS:Publish task, as shown in step 2, into the state machine. You configure the SNS:Publish task in a similar way to the error path. You just send a different message. For that, you can duplicate the state from the error path, drag it to the right place, and just modify it with the new message.

The DynamoDB:PutItem task puts an item into a DynamoDB table. This is a very handy task because we don’t need to execute this operation inside a Lambda function. To configure this task, you first change its name, as shown in step 3. Then, you need to configure the API parameters, as shown in step 4, to put the right data into the DynamoDB table.

These are the API parameters to use for this particular item (an account):

{
  "TableName": "<THE NAME OF YOUR TABLE>",
  "Item": {
    "id": {
      "S.$": "$.Name"
    },
    "mail": {
      "S.$": "$.Mail"
    },
    "work": {
      "S.$": "$.Work"
    }
  }
}

Save and execute the state machine
Workflow Studio created the ASL definition of the state machine for you, but you can always edit the ASL definition and return to the visual editor whenever you want to edit the state machine.

Now that your state machine is ready, you can run the first execution. Save it and start a new execution. When you start a new execution, a message will be displayed, asking for the input event to the state machine. Make sure that the attributes for this event are named Name, Mail and Work, because the execution of the state machine depends on those.

After you run your state machine, you see a visualization for the execution. It shows you all the steps that the execution ran. In each step, you see the step input and step output. This is very useful for debugging and fine-tuning the state machine.

Available Now

There are a lot of great features on our roadmap for Workflow Studio. Although the details may change, we are currently working to give you the power to visually create, run, and even debug workflow executions. Stay tuned for more information, and please feel free to send us feedback.

Workflow Studio is available now in all the AWS Regions where Step Functions is available.

Try it and learn more.

— Marcia

Introducing Amazon API Gateway service integration for AWS Step Functions

2020-11-24 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-amazon-api-gateway-service-integration-for-aws-step-functions/

AWS Step Functions now integrates with Amazon API Gateway to enable backend orchestration with minimal code and built-in error handling.

API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. These APIs enable applications to access data, business logic, or functionality from your backend services.

Step Functions allows you to build resilient serverless orchestration workflows with AWS services such as AWS Lambda, Amazon SNS, Amazon DynamoDB, and more. AWS Step Functions integrates with a number of services natively. Using Amazon States Language (ASL), you can coordinate these services directly from a task state.

What’s new?

The new Step Functions integration with API Gateway provides an additional resource type, arn:aws:states:::apigateway:invoke and can be used with both Standard and Express workflows. It allows customers to call API Gateway REST APIs and API Gateway HTTP APIs directly from a workflow, using one of two integration patterns:

Request-Response: calling a service and let Step Functions progress to the next state immediately after it receives an HTTP response. This pattern is supported by Standard and Express Workflows.
Wait-for-Callback: calling a service with a task token and have Step Functions wait until that token is returned with a payload. This pattern is supported by Standard Workflows.

The new integration is configured with the following Amazon States Language parameter fields:

ApiEndpoint: The API root endpoint.
Path: The API resource path.
Method: The HTTP request method.
HTTP headers: Custom HTTP headers.
RequestBody: The body for the API request.
Stage: The API Gateway deployment stage.
AuthType: The authentication type.

Refer to the documentation for more information on API Gateway fields and concepts.

Getting started

The API Gateway integration with Step Functions is configured using AWS Serverless Application Model (AWS SAM), the AWS Command Line Interface (AWS CLI), AWS CloudFormation or from within the AWS Management Console.

To get started with Step Functions and API Gateway using the AWS Management Console:

Go to the Step Functions page of the AWS Management Console.
Choose Run a sample project and choose Make a call to API Gateway.The Definition section shows the ASL that makes up the example workflow. The following example shows the new API Gateway resource and its parameters:
Review example Definition, then choose Next.
Choose Deploy resources.

This deploys a Step Functions standard workflow and a REST API with a /pets resource containing a GET and a POST method. It also deploys an IAM role with the required permissions to invoke the API endpoint from Step Functions.

The RequestBody field lets you customize the API’s request input. This can be a static input or a dynamic input taken from the workflow payload.

Running the workflow

Choose the newly created state machine from the Step Functions page of the AWS Management Console
Choose Start execution.

Paste the following JSON into the input field:

{
  "NewPet": {
    "type": "turtle",
    "price": 74.99
  }
}

Choose Start execution
Choose the Retrieve Pet Store Data step, then choose the Step output tab.

This shows the successful responseBody output from the “Add to pet store” POST request and the response from the “Retrieve Pet Store Data” GET request.

Access control

The API Gateway integration supports AWS Identity and Access Management (IAM) authentication and authorization. This includes IAM roles, policies, and tags.

AWS IAM roles and policies offer flexible and robust access controls that can be applied to an entire API or individual methods. This controls who can create, manage, or invoke your REST API or HTTP API.

Tag-based access control allows you to set more fine-grained access control for all API Gateway resources. Specify tag key-value pairs to categorize API Gateway resources by purpose, owner, or other criteria. This can be used to manage access for both REST APIs and HTTP APIs.

API Gateway resource policies are JSON policy documents that control whether a specified principal (typically an IAM user or role) can invoke the API. Resource policies can be used to grant access to a REST API via AWS Step Functions. This could be for users in a different AWS account or only for specified source IP address ranges or CIDR blocks.

To configure access control for the API Gateway integration, set the AuthType parameter to one of the following:

{“AuthType””: “NO_AUTH”}
Call the API directly without any authorization. This is the default setting.
{“AuthType””: “IAM_ROLE”}
Step Functions assumes the state machine execution role and signs the request with credentials using Signature Version 4.
{“AuthType””: “RESOURCE_POLICY”}
Step Functions signs the request with the service principal and calls the API endpoint.

Orchestrating microservices

Customers are already using Step Functions’ built in failure handling, decision branching, and parallel processing to orchestrate application backends. Development teams are using API Gateway to manage access to their backend microservices. This helps to standardize request, response formats and decouple business logic from routing logic. It reduces complexity by allowing developers to offload responsibilities of authentication, throttling, load balancing and more. The new API Gateway integration enables developers to build robust workflows using API Gateway endpoints to orchestrate microservices. These microservices can be serverless or container-based.

The following example shows how to orchestrate a microservice with Step Functions using API Gateway to access AWS services. The example code for this application can be found in this GitHub repository.

To run the application:

Clone the GitHub repository:

$ git clone https://github.com/aws-samples/example-step-functions-integration-api-gateway.git
$ cd example-step-functions-integration-api-gateway

Deploy the application using AWS SAM CLI, accepting all the default parameter inputs:
```
$ sam build && sam deploy -g
```
This deploys 17 resources including a Step Functions standard workflow, an API Gateway REST API with three resource endpoints, 3 Lambda functions, and a DynamoDB table. Make a note of the StockTradingStateMachineArn value. You can find this in the command line output or in the Applications section of the AWS Lambda Console:

Manually trigger the workflow from a terminal window:

aws stepFunctions start-execution \
--state-machine-arn <StockTradingStateMachineArnValue>

The response looks like:

When the workflow is run, a Lambda function is invoked via a GET request from API Gateway to the /check resource. This returns a random stock value between 1 and 100. This value is evaluated in the Buy or Sell choice step, depending on if it is less or more than 50. The Sell and Buy states use the API Gateway integration to invoke a Lambda function, with a POST method. A stock_value is provided in the POST request body. A transaction_result is returned in the ResponseBody and provided to the next state. The final state writes a log of the transition to a DynamoDB table.

Defining the resource with an AWS SAM template

The Step Functions resource is defined in this AWS SAM template. The DefinitionSubstitutions field is used to pass template parameters to the workflow definition.

StockTradingStateMachine:
    Type: AWS::Serverless::StateMachine # More info about State Machine Resource: https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-resource-statemachine.html
    Properties:
      DefinitionUri: statemachine/stock_trader.asl.json
      DefinitionSubstitutions:
        StockCheckPath: !Ref CheckPath
        StockSellPath: !Ref SellPath
        StockBuyPath: !Ref BuyPath
        APIEndPoint: !Sub "${ServerlessRestApi}.execute-api.${AWS::Region}.amazonaws.com"
        DDBPutItem: !Sub arn:${AWS::Partition}:states:::dynamodb:putItem
        DDBTable: !Ref TransactionTable

The workflow is defined on a separate file (/statemachine/stock_trader.asl.json).

The following code block defines the Check Stock Value state. The new resource, arn:aws:states:::apigateway:invoke declares the API Gateway service integration type.

The parameters object holds the required fields to configure the service integration. The Path and ApiEndpoint values are provided by the DefinitionsSubstitutions field in the AWS SAM template. The RequestBody input is defined dynamically using Amazon States Language. The .$ at the end of the field name RequestBody specifies that the parameter use a path to reference a JSON node in the input.

"Check Stock Value": {
  "Type": "Task",
  "Resource": "arn:aws:states:::apigateway:invoke",
  "Parameters": {
      "ApiEndpoint":"${APIEndPoint}",
      "Method":"GET",
      "Stage":"Prod",
      "Path":"${StockCheckPath}",
      "RequestBody.$":"$",
      "AuthType":"NO_AUTH"
  },
  "Retry": [
      {
          "ErrorEquals": [
              "States.TaskFailed"
          ],
          "IntervalSeconds": 15,
          "MaxAttempts": 5,
          "BackoffRate": 1.5
      }
  ],
  "Next": "Buy or Sell?"
},

The deployment process validates the ApiEndpoint value. The service integration builds the API endpoint URL from the information provided in the parameters block in the format https://[APIendpoint]/[Stage]/[Path].

Conclusion

The Step Functions integration with API Gateway provides customers with the ability to call REST APIs and HTTP APIs directly from a Step Functions workflow.

Step Functions’ built in error handling helps developers reduce code and decouple business logic. Developers can combine this with API Gateway to offload responsibilities of authentication, throttling, load balancing and more. This enables developers to orchestrate microservices deployed on containers or Lambda functions via API Gateway without managing infrastructure.

This feature is available in all Regions where both AWS Step Functions and Amazon API Gateway are available. View the AWS Regions table to learn more. For pricing information, see Step Functions pricing. Normal service limits of API Gateway and service limits of Step Functions apply.

For more serverless learning resources, visit Serverless Land.

Building Serverless Land: Part 2 – An auto-building static site

2020-11-10 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/building-serverless-land-part-2-an-auto-building-static-site/

In this two-part blog series, I show how serverlessland.com is built. This is a static website that brings together all the latest blogs, videos, and training for AWS serverless. It automatically aggregates content from a number of sources. The content exists in a static JSON file, which generates a new static site each time it is updated. The result is a low-maintenance, low-latency serverless website, with almost limitless scalability.

A companion blog post explains how to build an automated content aggregation workflow to create and update the site’s content. In this post, you learn how to build a static website with an automated deployment pipeline that re-builds on each GitHub commit. The site content is stored in JSON files in the same repository as the code base. The example code can be found in this GitHub repository.

The serverless nature of the site means that developers don’t need to manage infrastructure or scalability. The use of AWS Amplify Console to automatically deploy directly from GitHub enables a regular release cadence with a fast transition from prototype to production.

Static websites

A static site is served to the user’s web browser exactly as stored. This contrasts to dynamic webpages, which are generated by a web application. Static websites often provide improved performance for end users and have fewer or no dependant systems, such as databases or application servers. They may also be more cost-effective and secure than dynamic websites by using cloud storage, instead of a hosted environment.

A static site generator is a tool that generates a static website from a website’s configuration and content. Content can come from a headless content management system, through a REST API, or from data referenced within the website’s file system. The output of a static site generator is a set of static files that form the website.

Serverless Land uses a static site generator for Vue.js called Nuxt.js. Each time content is updated, Nuxt.js regenerates the static site, building the HTML for each page route and storing it in a file.

The architecture

Serverless Land static website architecture

When the content.json file is committed to GitHub, a new build process is triggered in AWS Amplify Console.

Deploying AWS Amplify

AWS Amplify helps developers to build secure and scalable full stack cloud applications. AWS Amplify Console is a tool within Amplify that provides a user interface with a git-based workflow for hosting static sites. Deploy applications by connecting to an existing repository (GitHub, BitBucket Cloud, GitLab, and AWS CodeCommit) to set up a fully managed, nearly continuous deployment pipeline.

This means that any changes committed to the repository trigger the pipeline to build, test, and deploy the changes to the target environment. It also provides instant content delivery network (CDN) cache invalidation, atomic deploys, password protection, and redirects without the need to manage any servers.

Building the static website

To get started, use the Nuxt.js scaffolding tool to deploy a boiler plate application. Make sure you have npx installed (npx is shipped by default with npm version 5.2.0 and above).
```
$ npx create-nuxt-app content-aggregator
```
The scaffolding tool asks some questions, answer as follows:Nuxt.js scaffolding tool inputs
Navigate to the project directory and launch it with:
```
$ cd content-aggregator
$ npm run dev
```
The application is now running on http://localhost:3000.The pages directory contains your application views and routes. Nuxt.js reads the .vue files inside this directory and automatically creates the router configuration.
Create a new file in the /pages directory named blogs.vue:$ touch pages/blogs.vue
Copy the contents of this file into pages/blogs.vue.
Create a new file in /components directory named Post.vue :$ touch components/Post.vue
Copy the contents of this file into components/Post.vue.
Create a new file in /assets named content.json and copy the contents of this file into it.$ touch /assets/content.json

The blogs Vue component

The blogs page is a Vue component with some special attributes and functions added to make development of your application easier. The following code imports the content.json file into the variable blogPosts. This file stores the static website’s array of aggregated blog post content.

import blogPosts from '../assets/content.json'

An array named blogPosts is initialized:

data(){
    return{
      blogPosts: []
    }
  },

The array is then loaded with the contents of content.json.

 mounted(){
    this.blogPosts = blogPosts
  },

In the component template, the v-for directive renders a list of post items based on the blogPosts array. It requires a special syntax in the form of blog in blogPosts, where blogPosts is the source data array and blog is an alias for the array element being iterated on. The Post component is rendered for each iteration. Since components have isolated scopes of their own, a :post prop is used to pass the iterated data into the Post component:

<ul>
  <li v-for="blog in blogPosts" :key="blog">
     <Post :post="blog" />
  </li>
</ul>

The post data is then displayed by the following template in components/Post.vue.

<template>
    <div class="hello">
      <h3>{{ post.title }} </h3>
      <div class="img-holder">
          <img :src="post.image" />
      </div>
      <p>{{ post.intro }} </p>
      <p>Published on {{post.date}}, by {{ post.author }} p>
      <a :href="post.link"> Read article</a>
    </div>
</template>

This forms the framework for the static website. The /blogs page displays content from /assets/content.json via the Post component. To view this, go to http://localhost:3000/blogs in your browser:

The /blogs page

Add a new item to the content.json file and rebuild the static website to display new posts on the blogs page. The previous content was generated using the aggregation workflow explained in this companion blog post.

Connect to Amplify Console

Clone the web application to a GitHub repository and connect it to Amplify Console to automate the rebuild and deployment process:

Upload the code to a new GitHub repository named ‘content-aggregator’.
In the AWS Management Console, go to the Amplify Console and choose Connect app.
Choose GitHub then Continue.
Authorize to your GitHub account, then in the Recently updated repositories drop-down select the ‘content-aggregator’ repository.
In the Branch field, leave the default as master and choose Next.
In the Build and test settings choose edit.
Replace - npm run build with – npm run generate.
Replace baseDirectory: / with baseDirectory: dist

This runs the nuxt generate command each time an application build process is triggered. The nuxt.config.js file has a target property with the value of static set. This generates the web application into static files. Nuxt.js creates a dist directory with everything inside ready to be deployed on a static hosting service.
Choose Save then Next.
Review the Repository details and App settings are correct. Choose Save and deploy.

Amplify Console deployment

Once the deployment process has completed and is verified, choose the URL generated by Amplify Console. Append /blogs to the URL, to see the static website blogs page.

Any edits pushed to the repository’s content.json file trigger a new deployment in Amplify Console that regenerates the static website. This companion blog post explains how to set up an automated content aggregator to add new items to the content.json file from an RSS feed.

Conclusion

This blog post shows how to create a static website with vue.js using the nuxt.js static site generator. The site’s content is generated from a single JSON file, stored in the site’s assets directory. It is automatically deployed and re-generated by Amplify Console each time a new commit is pushed to the GitHub repository. By automating updates to the content.json file you can create low-maintenance, low-latency static websites with almost limitless scalability.

This application framework is used together with this automated content aggregator to pull together articles for http://serverlessland.com. Serverless Land brings together all the latest blogs, videos, and training for AWS Serverless. Download the code from this GitHub repository to start building your own automated content aggregation platform.

Building Serverless Land: Part 1 – Automating content aggregation

2020-11-03 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/building-serverless-land-part-1-automating-content-aggregation/

In this two part blog series, I show how serverlessland.com is built. This is a static website that brings together all the latest blogs, videos, and training for AWS Serverless. It automatically aggregates content from a number of sources. The content exists in static JSON files, which generate a new site build each time they are updated. The result is a low-maintenance, low-latency serverless website, with almost limitless scalability.

This blog post explains how to automate the aggregation of content from multiple RSS feeds into a JSON file stored in GitHub. This workflow uses AWS Lambda and AWS Step Functions, triggered by Amazon EventBridge. The application can be downloaded and deployed from this GitHub repository.

The growing adoption of serverless technologies generates increasing amounts of helpful and insightful content from the developer community. This content can be difficult to discover. Serverless Land helps channel this into a single searchable location. By automating the collection of this content with scheduled serverless workflows, the process robustly scales to near infinite numbers. The Step Functions MAP state allows for dynamic parallel processing of multiple content sources, without the need to alter code. On-boarding a new content source is as fast and simple as making a single CLI command.

The architecture

Automating content aggregation with AWS Step Functions

The application consists of six Lambda functions orchestrated by a Step Functions workflow:

The workflow is triggered every 2 hours by an EventBridge scheduler. The schedule event passes an RSS feed URL to the workflow.
The first task invokes a Lambda function that runs an HTTP GET request to the RSS feed. It returns an array of recent blog URLs. The array of blog URLs is provided as the input to a MAP state. The MAP state type makes it possible to run a set of steps for each element of an input array in parallel. The number of items in the array can be different for each execution. This is referred to as dynamic parallelism.
The next task invokes a Lambda function that uses the GitHub REST API to retrieve the static website’s JSON content file.
The first Lambda function in the MAP state runs an HTTP GET request to the blog post URL provided in the payload. The URL is scraped for content and an object containing detailed metadata about the blog post is returned in the response.
The blog post metadata is compared against the website’s JSON content file in GitHub.
A CHOICE state determines if the blog post metadata has already been committed to the repository.
If the blog post is new, it is added to an array of “content to commit”.
As the workflow exits the MAP state, the results are passed to the final Lambda function. This uses a single git commit to add each blog post object to the website’s JSON content file in GitHub. This triggers an event that rebuilds the static site.

Using Secrets in AWS Lambda

Two of the Lambda functions require a GitHub personal access token to commit files to a repository. Sensitive credentials or secrets such as this should be stored separate to the function code. Use AWS Systems Manager Parameter Store to store the personal access token as an encrypted string. The AWS Serverless Application Model (AWS SAM) template grants each Lambda function permission to access and decrypt the string in order to use it.

Follow these steps to create a personal access token that grants permission to update files to repositories in your GitHub account.
Use the AWS Command Line Interface (AWS CLI) to create a new parameter named GitHubAPIKey:

aws ssm put-parameter \
--name /GitHubAPIKey \
--value ReplaceThisWithYourGitHubAPIKey \
--type SecureString

{
    "Version": 1,
    "Tier": "Standard"
}

Deploying the application

Fork this GitHub repository to your GitHub Account.
Clone the forked repository to your local machine and deploy the application using AWS SAM.

In a terminal, enter:

git clone https://github.com/aws-samples/content-aggregator-example
sam deploy -g

Enter the required parameters when prompted.

This deploys the application defined in the AWS SAM template file (template.yaml).

The business logic

Each Lambda function is written in Node.js and is stored inside a directory that contains the package dependencies in a `node_modules` folder. These are defined for each function by its relative package.json file. The function dependencies are bundled and deployed using the sam build && deploy -g command.

The GetRepoContents and WriteToGitHub Lambda functions use the octokit/rest.js library to communicate with GitHub. The library authenticates to GitHub by using the GitHub API key held in Parameter Store. The AWS SDK for Node.js is used to obtain the API key from Parameter Store. With a single synchronous call, it retrieves and decrypts the parameter value. This is then used to authenticate to GitHub.

const AWS = require('aws-sdk');
const SSM = new AWS.SSM();


//get Github API Key and Authenticate
    const singleParam = { Name: '/GitHubAPIKey ',WithDecryption: true };
    const GITHUB_ACCESS_TOKEN = await SSM.getParameter(singleParam).promise();
    const octokit = await  new Octokit({
      auth: GITHUB_ACCESS_TOKEN.Parameter.Value,
    })

Lambda environment variables are used to store non-sensitive key value data such as the repository name and JSON file location. These can be entered when deploying with AWS SAM guided deploy command.

Environment:
        Variables:
          GitHubRepo: !Ref GitHubRepo
          JSONFile: !Ref JSONFile

The GetRepoContents function makes a synchronous HTTP request to the GitHub repository to retrieve the contents of the website’s JSON file. The response SHA and file contents are returned from the Lambda function and acts as the input to the next task in the Step Functions workflow. This SHA is used in final step of the workflow to save all new blog posts in a single commit.

Map state iterations

The MAP state runs concurrently for each element in the input array (each blog post URL).

Each iteration must compare a blog post URL to the existing JSON content file and decide whether to ignore the post. To do this, the MAP state requires both the input array of blog post URLs and the existing JSON file contents. The ItemsPath, ResultPath, and Parameters are used to achieve this:

The ItemsPath sets input array path to $.RSSBlogs.body.
The ResultPath states that the output of the branches is placed in $.mapResults.
The Parameters block replaces the input to the iterations with a JSON node. This contains both the current item data from the context object ($$.Map.Item.Value) and the contents of the GitHub JSON file ($.RepoBlogs).

"Type":"Map",
    "InputPath": "$",
    "ItemsPath": "$.RSSBlogs.body",
    "ResultPath": "$.mapResults",
    "Parameters": {
        "BlogUrl.$": "$$.Map.Item.Value",
        "RepoBlogs.$": "$.RepoBlogs"
     },
    "MaxConcurrency": 0,
    "Iterator": {
       "StartAt": "getMeta",

The Step Functions resource

The AWS SAM template uses the following Step Functions resource definition to create a Step Functions state machine:

  MyStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: statemachine/my_state_machine.asl.JSON
      DefinitionSubstitutions:
        GetBlogPostArn: !GetAtt GetBlogPost.Arn
        GetUrlsArn: !GetAtt GetUrls.Arn
        WriteToGitHubArn: !GetAtt WriteToGitHub.Arn
        CompareAgainstRepoArn: !GetAtt CompareAgainstRepo.Arn
        GetRepoContentsArn: !GetAtt GetRepoContents.Arn
        AddToListArn: !GetAtt AddToList.Arn
      Role: !GetAtt StateMachineRole.Arn

The actual workflow definition is defined in a separate file (statemachine/my_state_machine.asl.JSON). The DefinitionSubstitutions property specifies mappings for placeholder variables. This enables the template to inject Lambda function ARNs obtained by the GetAtt intrinsic function during template translation:

Step Functions mappings with placeholder variables

A state machine execution role is defined within the AWS SAM template. It grants the `Lambda invoke function` action. This is tightly scoped to the six Lambda functions that are used in the workflow. It is the minimum set of permissions required for the Step Functions to carry out its task. Additional permissions can be granted as necessary, which follows the zero-trust security model.

Action: lambda:InvokeFunction
Resource:
- !GetAtt GetBlogPost.Arn
- !GetAtt GetUrls.Arn
- !GetAtt CompareAgainstRepo.Arn
- !GetAtt WriteToGitHub.Arn
- !GetAtt AddToList.Arn
- !GetAtt GetRepoContents.Arn

The Step Functions workflow definition is authored using the AWS Toolkit for Visual Studio Code. The Step Functions support allows developers to quickly generate workflow definitions from selectable examples. The render tool and automatic linting can help you debug and understand the workflow during development. Read more about the toolkit in this launch post.

Scheduling events and adding new feeds

The AWS SAM template creates a new EventBridge rule on the default event bus. This rule is scheduled to invoke the Step Functions workflow every 2 hours. A valid JSON string containing an RSS feed URL is sent as the input payload. The feed URL is obtained from a template parameter and can be set on deployment. The AWS Compute Blog is set as the default feed URL. To aggregate additional blog feeds, create a new rule to invoke the Step Functions workflow. Provide the RSS feed URL as valid JSON input string in the following format:

{“feedUrl”:”replace-this-with-your-rss-url”}

ScheduledEventRule:
    Type: "AWS::Events::Rule"
    Properties:
      Description: "Scheduled event to trigger Step Functions state machine"
      ScheduleExpression: rate(2 hours)
      State: "ENABLED"
      Targets:
        -
          Arn: !Ref MyStateMachine
          Id: !GetAtt MyStateMachine.Name
          RoleArn: !GetAtt ScheduledEventIAMRole.Arn
          Input: !Sub
            - >
              {
                "feedUrl" : "${RssFeedUrl}"
              }
            - RssFeedUrl: !Ref RSSFeed

A completed workflow with step output

Conclusion

This blog post shows how to automate the aggregation of content from multiple RSS feeds into a single JSON file using serverless workflows.

The Step Functions MAP state allows for dynamic parallel processing of each item. The recent increase in state payload size limit means that the contents of the static JSON file can be held within the workflow context. The application decision logic is separated from the business logic and events.

Lambda functions are scoped to finite business logic with Step Functions states managing decision logic and iterations. EventBridge is used to manage the inbound business events. The zero-trust security model is followed with minimum permissions granted to each service and Parameter Store used to hold encrypted secrets.

This application is used to pull together articles for http://serverlessland.com. Serverless land brings together all the latest blogs, videos, and training for AWS Serverless. Download the code from this GitHub repository to start building your own automated content aggregation platform.

Introducing AWS X-Ray new integration with AWS Step Functions

2020-09-14 Benjamin Smith

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/introducing-aws-x-ray-new-integration-with-aws-step-functions/

AWS Step Functions now integrates with AWS X-Ray to provide a comprehensive tracing experience for serverless orchestration workflows.

Step Functions allows you to build resilient serverless orchestration workflows with AWS services such as AWS Lambda, Amazon SNS, Amazon DynamoDB, and more. Step Functions provides a history of executions for a given state machine in the AWS Management Console or with Amazon CloudWatch Logs.

AWS X-Ray is a distributed tracing system that helps developers analyze and debug their applications. It traces requests as they travel through the individual services and resources that make up an application. This provides an end-to-end view of how an application is performing.

What is new?

The new Step Functions integration with X-Ray provides an additional workflow monitoring experience. Developers can now view maps and timelines of the underlying components that make up a Step Functions workflow. This helps to discover performance issues, detect permission problems, and track requests made to and from other AWS services.

The Step Functions integration with X-Ray can be analyzed in three constructs:

Service map: The service map view shows information about a Step Functions workflow and all of its downstream services. This enables developers to identify services where errors are occurring, connections with high latency, or traces for requests that are unsuccessful among the large set of services within their account. The service map aggregates data from specific time intervals from one minute through six hours and has a 30-day retention.

Trace map view: The trace map view shows in-depth information from a single trace as it moves through each service. Resources are listed in the order in which they are invoked.

Trace timeline: The trace timeline view shows the propagation of a trace through the workflow and is paired with a time scale called a latency distribution histogram. This shows how long it takes for a service to complete its requests. The trace is composed of segments and sub-segments. A segment represents the Step Functions execution. Subsegments each represent a state transition.

Getting Started

X-Ray tracing is enabled using AWS Serverless Application Model (AWS SAM), AWS CloudFormation or from within the AWS Management Console. To get started with Step Functions and X-Ray using the AWS Management Console:

Go to the Step Functions page of the AWS Management Console.
Choose Get Started, review the Hello World example, then choose Next.
Check Enable X-Ray tracing from the Tracing section.

Workflow visibility

The following Step Functions workflow example is invoked via Amazon EventBridge when a new file is uploaded to an Amazon S3 bucket. The workflow uses Amazon Textract to detect text from an image file. It translates the text into multiple languages using Amazon Translate and saves the results into an Amazon DynamoDB table. X-Ray has been enabled for this workflow.

To view the X-Ray service map for this workflow, I choose the X-Ray trace map link at the top of the Step Functions Execution details page:

The service map is generated from trace data sent through the workflow. Toggling the Service Icons displays each individual service in this workload. The size of each node is weighted by traffic or health, depending on the selection.

This shows the error percentage and average response times for each downstream service. T/min is the number of traces sent per minute in the selected time range. The following map shows a 67% error rate for the Step Functions workflow.

Accelerated troubleshooting

By drilling down through the service map, to the individual trace map, I quickly pinpoint the error in this workflow. I choose the Step Functions service from the trace map. This opens the service details panel. I then choose View traces. The trace data shows that from a group of nine responses, 3 completed successfully and 6 completed with error. This correlates with the response times listed for each individual trace. Three traces complete in over 5 seconds, while 6 took less than 3 seconds.

Choosing one of the faster traces opens the trace timeline map. This illustrates the aggregate response time for the workflow and each of its states. It shows a state named Read text from image invoked by a Lambda Function. This takes 2.3 seconds of the workflow’s total 2.9 seconds to complete.

A warning icon indicates that an error has occurred in this Lambda function. Hovering the curser over the icon, reveals that the property “Blocks” is undefined. This shows that an error occurred within the Lambda function (no text was found within the image). The Lambda function did not have sufficient error handling to manage this error gracefully, so the workflow exited.

Here’s how that same state execution failure looks in the Step Functions Graph inspector.

Performance profiling

The visualizations provided in the service map are useful for estimating the average latency in a workflow, but issues are often indicated by statistical outliers. To help investigate these, the Response distribution graph shows a distribution of latencies for each state within a workflow, and its downstream services.

Latency is the amount of time between when a request starts and when it completes. It shows duration on the x-axis, and the percentage of requests that match each duration on the y-axis. Additional filters are applied to find traces by duration or status code. This helps to discover patterns and to identify specific cases and clients with issues at a given percentile.

Sampling

X-Ray applies a sampling algorithm to determine which requests to trace. A sampling rate of 100% is used for state machines with an execution rate of less than one per second. State machines running at a rate greater than one execution per second default to a 5% sampling rate. Configure the sampling rate to determine what percentage of traces to sample. Enable trace sampling with the AWS Command Line Interface (AWS CLI) using the CreateStateMachine and UpdateStateMachine APIs with the enable-Trace-Sampling attribute:

--enable-trace-sampling true

It can also be configured in the AWS Management Console.

Trace data retention and limits

X-Ray retains tracing data for up to 30 days with a single trace holding up to 7 days of execution data. The current minimum guaranteed trace size is 100Kb, which equates to approximately 80 state transitions. The actual number of state transitions supported will depend on the upstream and downstream calls and duration of the workflow. When the trace size limit is reached, the trace cannot be updated with new segments or updates to existing segments. The traces that have reached the limit are indicated with a banner in the X-Ray console.

For a full service comparison of X-Ray trace data and Step Functions execution history, please refer to the documentation.

Conclusion

The Step Functions integration with X-Ray provides a single monitoring dashboard for workflows running at scale. It provides a high-level system overview of all workflow resources and the ability to drill down to view detailed timelines of workflow executions. You can now use the orchestration capabilities of Step Functions with the tracing, visualization, and debug capabilities of AWS X-Ray.

This enables developers to reduce problem resolution times by visually identifying errors in resources and viewing error rates across workflow executions. You can profile and improve application performance by identifying outliers while analyzing and debugging high latency and jitter in workflow executions.

This feature is available in all Regions where both AWS Step Functions and AWS X-Ray are available. View the AWS Regions table to learn more. For pricing information, see AWS X-Ray pricing.

To learn more about Step Functions, read the Developer Guide. For more serverless learning resources, visit https://serverlessland.com.

Introducing larger state payloads for AWS Step Functions

2020-09-04 Rob Sutter

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/introducing-larger-state-payloads-for-aws-step-functions/

AWS Step Functions allows you to create serverless workflows that orchestrate your business processes. Step Functions stores data from workflow invocations as application state. Today we are increasing the size limit of application state from 32,768 characters to 256 kilobytes of data per workflow invocation. The new limit matches payload limits for other commonly used serverless services such as Amazon SNS, Amazon SQS, and Amazon EventBridge. This means you no longer need to manage Step Functions payload limitations as a special case in your serverless applications.

Faster, cheaper, simpler state management

Previously, customers worked around limits on payload size by storing references to data, such as a primary key, in their application state. An AWS Lambda function then loaded the data via an SDK call at runtime when the data was needed. With larger payloads, you now can store complete objects directly in your workflow state. This removes the need to persist and load data from data stores such as Amazon DynamoDB and Amazon S3. You do not pay for payload size, so storing data directly in your workflow may reduce both cost and execution time of your workflows and Lambda functions. Storing data in your workflow state also reduces the amount of code you need to write and maintain.

AWS Management Console and workflow history improvements

Larger state payloads mean more data to visualize and search. To help you understand that data, we are also introducing changes to the AWS Management Console for Step Functions. We have improved load time for the Execution History page to help you get the information you need more quickly. We have also made backwards-compatible changes to the GetExecutionHistory API call. Now if you set includeExecutionData to false, GetExecutionHistory excludes payload data and returns only metadata. This allows you to debug your workflows more quickly.

Doing more with dynamic parallelism

A larger payload also allows your workflows to process more information. Step Functions workflows can process an arbitrary number of tasks concurrently using dynamic parallelism via the Map State. Dynamic parallelism enables you to iterate over a collection of related items applying the same process to each item. This is an implementation of the map procedure in the MapReduce programming model.

When to choose dynamic parallelism

Choose dynamic parallelism when performing operations on a small collection of items generated in a preliminary step. You define an Iterator, which operates on these items individually. Optionally, you can reduce the results to an aggregate item. Unlike with parallel invocations, each item in the collection is related to the other items. This means that an error in processing one item typically impacts the outcome of the entire workflow.

Example use case

Ecommerce and line of business applications offer many examples where dynamic parallelism is the right approach. Consider an order fulfillment system that receives an order and attempts to authorize payment. Once payment is authorized, it attempts to lock each item in the order for shipment. The available items are processed and their total is taken from the payment authorization. The unavailable items are marked as pending for later processing.

The following Amazon States Language (ASL) defines a Map State with a simplified Iterator that implements the order fulfillment steps described previously.


    "Map": {
      "Type": "Map",
      "ItemsPath": "$.orderItems",
      "ResultPath": "$.packedItems",
      "MaxConcurrency": 40,
      "Next": "Print Label",
      "Iterator": {
        "StartAt": "Lock Item",
        "States": {
          "Lock Item": {
            "Type": "Pass",
            "Result": "Item locked!",
            "Next": "Pull Item"
          },
          "Pull Item": {
            "Type": "Pass",
            "Result": "Item pulled!",
            "Next": "Pack Item"
          },
          "Pack Item": {
            "Type": "Pass",
            "Result": "Item packed!",
            "End": true
          }
        }
      }
    }

The following image provides a visualization of this workflow. A preliminary state retrieves the collection of items from a data store and loads it into the state under the orderItems key. The triple dashed lines represent the Map State which attempts to lock, pull, and pack each item individually. The result of processing each individual item impacts the next state, Print Label. As more items are pulled and packed, the total weight increases. If an item is out of stock, the total weight will decrease.

A visualization of a portion of an AWS Step Functions workflow that implements dynamic parallelism

Dynamic parallelism or the “Map State”

Larger state payload improvements

Without larger state payloads, each item in the $.orderItems object in the workflow state would be a primary key to a specific item in a DynamoDB table. Each step in the “Lock, Pull, Pack” workflow would need to read data from DynamoDB for every item in the order to access detailed item properties.

With larger state payloads, each item in the $.orderItems object can be a complete object containing the required fields for the relevant items. Not only is this faster, resulting in a better user experience, but it also makes debugging workflows easier.

Pricing and availability

Larger state payloads are available now in all commercial and AWS GovCloud (US) Regions where Step Functions is available. No changes to your workflows are required to use larger payloads, and your existing workflows will continue to run as before. The larger state is available however you invoke your Step Functions workflows, including the AWS CLI, the AWS SDKs, the AWS Step Functions Data Science SDK, and Step Functions Local.

Larger state payloads are included in existing Step Functions pricing for Standard Workflows. Because Express Workflows are priced by runtime and memory, you may see more cost on individual workflows with larger payloads. However, this increase may also be offset by the reduced cost of Lambda, DynamoDB, S3, or other AWS services.

Conclusion

Larger Step Functions payloads simplify and increase the efficiency of your workflows by eliminating function calls to persist and retrieve data. Larger payloads also allow your workflows to process more data concurrently using dynamic parallelism.

With larger payloads, you can minimize the amount of custom code you write and focus on the business logic of your workflows. Get started building serverless workflows today!