Tag Archives: Architecture

Enhance Your Contact Center Solution with Automated Voice Authentication and Visual IVR

2022-02-03 Soonam Jose

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/enhance-your-contact-center-solution-with-automated-voice-authentication-and-visual-ivr/

Recently, the Accenture AWS Business Group (AABG) assisted a customer in developing a secure and personalized Interactive Voice Response (IVR) contact center experience that receives and processes payments and responds to customer inquiries.

Our solution uses Amazon Connect at its core to help customers efficiently engage with customer service agents. To ensure transactions are completed securely and to prevent fraud, the architecture provides voice authentication using Amazon Connect Voice ID and a visual portal to submit payments. The visual IVR feature allows customers to easily provide the required information online while the IVR is on standby. The solution also provides agents the information they need to effectively and efficiently understand and resolve callers’ inquiries, which helps improve the quality of their service.

Overview of solution

Our IVR is designed using Contact Flows on Amazon Connect and uses the following services:

Amazon Lex provides the voice-based intent analysis. Intent analysis is the process of determining the underlying intention behind customer interactions.
Amazon Connect integrates with other AWS services using AWS Lambda.
Amazon DynamoDB stores customer data.
Amazon Pinpoint notifies customers via text and email.
AWS Amplify provides the customized agent dashboard and generates the visual IVR portal.

Figure 1 shows how this architecture routes customer calls:

Callers dial the main line to interact with the IVR in Amazon Connect.
Amazon Connect Voice ID sets up a voiceprint for first time callers or performs voice authentication for repeat callers for added security.
Upon successful voice authentication, callers can proceed to IVR self-service functions, such as checking their account balance or making a payment. Amazon Lex handles the voice intent analysis.
When callers make a payment request, they are given the option to be handed off securely to a visual IVR portal to process their payment.
If a caller requests to be connected to an agent, the agent will be presented with the customer’s information and IVR interaction details on their agent dashboard.

Figure 1. Architecture diagram

Customer IVR experience

Figure 2 describes how callers navigate through the IVR:

The IVR asks the caller the purpose of the call.
The caller’s answer is sent for voice intent analysis. The IVR also attempts to authenticate the caller’s voice using Amazon Connect Voice ID. If authenticated, the caller is automatically routed to the correct flow based on the analyzed intent.
- For the “Account Balance” flow, the caller is provided the account balance information.
- For the “Make a Payment” flow, the caller can use the IVR or a visual IVR portal to process the payment. Upon payment completion, the caller is immediately notified their transaction has completed via SMS or email. Both flows allow the caller to be transferred to an agent. The caller also has the option to be called back when an agent becomes available or choose a specific date and time for the callback.

Figure 2. Customer IVR experience diagram

The intelligent self-service IVR solution includes the following features:

The IVR can redirect callers to a payment portal for scenarios like making a payment while the IVR remains on standby.
IVR transaction tracking helps agents understand the current status of the caller’s transaction and quickly determines the caller’s situation.
Callers have the option to receive a call as soon as the next agent becomes available or they can schedule a time that works for them to receive a callback.
IVR activity logging gives agents a detailed summary of the caller’s actions within the IVR.
Transaction confirmation which notifies callers of successful transactions via SMS or email.

Solution walkthrough

Amazon Connect Voice ID authenticates a caller’s voice as an added level of security. It requires 30 seconds to create the initial enrollment voiceprint and 10 seconds of a caller’s voice to authenticate. If there is not enough net speech to perform the voice authentication, the IVR asks the caller more questions, such as their first name and last name, until it has collected enough net speech.

The IVR falls back to dual tone multi-frequency (DTMF) input for the caller’s credentials in case the system cannot successfully authenticate. This can include information like the last four digits of their national identification number or postal code.

In contact flows, you will enable voice authentication by adding the “Set security behavior” contact block and specifying the authentication threshold, as shown in Figure 3.

Figure 3. Set security behavior contact block

Figure 4 shows the “Check security status” contact block, which determines if the user has been successfully authenticated or not. It also shows results that it may return if the caller is not successfully authenticated, including, “Not authenticated,” “Inconclusive,” “Not enrolled,” “Opted out,” and “Error.”

Figure 4. Check security status contact block

Providing a personalized experience for callers

To provide a personalized experience for callers, sample customer data is stored in a DynamoDB table. A Lambda function queries this table when callers call the contact center. The query returns information about the caller, such as their name, so the IVR can offer a customized greeting.

Transaction tracking

The table can also query if a customer previously called and attempted to make a payment but didn’t complete it successfully. This feature is called “transaction tracking.” Here’s how it works:

When the caller progresses through the “make a payment” flow, a field in the table is updated to reflect their transaction’s status.
If the payment is abandoned, the status in the table remains open, and the IVR prompts the caller to pick up where they left off the next time they call.
Once they have successfully completed their payment, we update the status in the table to “complete.”
When the IVR confirms that the caller’s payment has gone through, they will receive a confirmation via SMS and email. A Lambda function in the contact flow receives the caller’s phone number and email address. Then it distributes the confirmation messages via Amazon Pinpoint.

If a call is escalated to an agent, the “Check contact attributes” contact block in Figure 5 helps to check the caller’s intent and provide the agent with a customized whisper.

Figure 5. Agent whisper sample contact flow

Making payments via the payment portal

To make a payment, an Amazon Lex bot presents the caller with the option to provide payment details over the phone or through a visual IVR portal.

If they choose to use the visual IVR portal (Figure 6), they can enter their payment details while maintaining an open phone connection with the contact center, in case they need additional assistance. Here’s how it works:

When callers select to use the payment portal, it prompts a Lambda function that generates a universally unique identifier (UUID) and provides the caller a unique PIN.
The UUID and PIN are stored in the DynamoDB table along with the caller’s information.
Another Lambda function generates a secure link using the UUID. It then uses Amazon Pinpoint to send the link to the caller over text message to their phone number on record. When they open the link, they are prompted to enter their unique PIN.
Then, the webpage makes an API call that validates the payment request by comparing the entered PIN to the PIN stored in the DynamoDB table.
Once validated, the caller can enter their payment information.

Figure 6. Visual IVR portal

Figure 7 illustrates visual IVR portal contact flow:

Every 10 seconds, a Lambda function checks the caller’s payment status. It provides the caller the option to escalate to an agent if they have questions.
If the caller does not fill out all the information when they hit “Submit Payment,” an IVR prompt will ask them to provide all payment details before proceeding.
The IVR phone call stays active until the user’s payment status is updated to “complete” in the DynamoDB table. This generates an IVR prompt stating that their payment was successful.

Figure 7. Visual IVR portal sample contact flow

Generating a chat transcript for agents

When the customer’s call is escalated to an agent, the agent receives a chat transcript. Here’s how it works:

After the caller’s intent is captured at the start of the call, the IVR logs activity using a “Set contact attribute” contact block, which prompts the $.Lex.SessionAttributes.transcript.
This transcript is used in a Lambda function to build a chat interface.
This transcript is shown on the agent’s dashboard, along with the Contact Control Panel (CCP) and a few key pieces of caller information.

Figure 8. IVR transcript

The agent’s customized dashboard and the visual IVR portal are deployed and hosted on Amplify. This allows us to seamlessly connect to our code repository and automate deployments after changes are committed. It removed the need to configure Amazon Simple Storage Service (Amazon S3) buckets, an Amazon CloudFront distribution, and Amazon Route 53 DNS to host our front-end components.

This solution also offers callers the ability to opt-in for a callback or to schedule a callback. A “Check queue status” contact block checks the current time in queue, and if it reaches a certain threshold, the IVR will offer a callback. The caller has the option to receive a call as soon as the next agent becomes available or to schedule a time to receive a callback. A Lex bot gathers the date and time slots, which are then passed to a Lambda function that will validate the proposed callback option.

Once confirmed, the scheduled callback is placed into a DynamoDB table along with the caller’s phone number. Another Lambda function scans the table every 5 minutes to see if there are any callbacks scheduled within that 5-minute time period. You’ll add an Amazon EventBridge prompt to the Lambda function that specifies a schedule expression like cron(0/5 8-17 ? * MON-FRI *), which means the Lambda function will execute every 5 minutes, Monday through Friday from 8:00 AM to 4:55 PM.

Conclusion

This solution helps you increase customer satisfaction by making it easier for callers to complete transactions over the phone. The visual IVR provides added web-based support experience to submit payments. It also improves the quality of service of your customer service agents by making relevant information available to agents during the call.

This solution also allows you to scale out the resources to handle increasing demand. Custom features can easily be added using serverless technology, such as Lambda functions or other cloud-native services on AWS.

Ready to get started? The AABG helps customers accelerate their pace of digital innovation and realize incremental business value from cloud adoption and transformation. Connect with our team at [email protected] to learn how to use machine learning in your products and services.

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Optimizing Your IoT Devices for Environmental Sustainability

2022-02-02 Jonas Bürkel

Post Syndicated from Jonas Bürkel original https://aws.amazon.com/blogs/architecture/optimizing-your-iot-devices-for-environmental-sustainability/

To become more environmentally sustainable, customers commonly introduce Internet of Things (IoT) devices. These connected devices collect and analyze data from commercial buildings, factories, homes, cars, and other locations to measure, understand, and improve operational efficiency. (There will be an estimated 24.1 billion active IoT devices by 2030 according to Transforma Insights.)

IoT devices offer several efficiencies. However, you must consider their environmental impact when using them. Devices must be manufactured, shipped, and installed; they consume energy during operations; and they must eventually be disposed of. They are also a challenge to maintain—an expert may need physical access to the device to diagnose issues and update it. This is especially true for smaller and cheaper devices, because extended device support and ongoing enhancements are often not economically feasible, which results in more frequent device replacements.

When architecting a solution to tackle operational efficiency challenges with IoT, consider the devices’ impact on environmental sustainability. Think critically about the impact of the devices you deploy and work to minimize their overall carbon footprint. This post considers device properties that influence an IoT device’s footprint throughout its lifecycle and shows you how Amazon Web Services (AWS) IoT services can help.

Architect for lean, efficient, and durable devices

So which device properties contribute towards minimizing environmental impact?

Lean devices use just the right amount of resources to do their job. They are designed, equipped, and built to use fewer resources, which reduces the impact of manufacturing and disposing them as well as their energy consumption. For example, electronic devices like smartphones use rare-earth metals in many of their components. These materials impact the environment when mined and disposed of. By reducing the amount of these materials used in your design, you can move towards being more sustainable.
Efficient devices lower their operational impact by using up-to-date and secure software and enhancements to code and data handling.
Durable devices remain in the field for a long time and still provide their intended function and value. They can adapt to changing business requirements and are able to recover from operational failure. The longer the device functions, the lower its carbon footprint will be. This is because device manufacturing, shipping, installing, and disposing will require relatively less effort.

In summary, deploy devices that efficiently use resources to bring business value for as long as possible. Finding the right tradeoff for your requirements allows you to improve operational efficiency while also maximizing your benefit on environmental sustainability.

High-level sustainable IoT architecture

Figure 1 shows building blocks that support sustainable device properties. Their main capabilities are:

Enabling remote device management
Allowing over-the-air (OTA) updates
Integrating with cloud services to access further processing capabilities while ensuring security of devices and data, at rest and in transit

Figure 1. Generic architecture for sustainable IoT devices

Introducing AWS IoT Core and AWS IoT Greengrass to your architecture

Assuming you have an at least partially connected environment, the capabilities outlined in Figure 1 can be achieved by using mainly two AWS IoT services:

AWS IoT Core is a managed cloud platform that lets connected devices easily and securely interact with cloud applications and other devices.
AWS IoT Greengrass is an IoT open-source edge runtime and cloud service that helps you build, deploy, and manage device software.

Figure 2 shows how the building blocks introduced in Figure 1 translate to AWS IoT services.

Figure 2. AWS architecture for sustainable IoT devices

Optimize your IoT devices for leanness and efficiency with AWS IoT Core

AWS IoT Core securely integrates IoT devices with other devices and the cloud. It allows devices to publish and subscribe to data in the cloud using device communication protocols. You can use this functionality to create event-driven data processing flows that can be integrated with additional services. For example, you can run machine learning inference, perform analytics, or interact with applications running on AWS.

According to a 451 Research report published in 2019, AWS can perform the same compute task with an 88% lower carbon footprint compared to the median of surveyed US enterprise data centers. More than two-thirds of this carbon reduction is attributable to more efficient servers and a higher server utilization. In 2021, 451 Research published similar reports for data centers in Asia Pacific and Europe.

AWS IoT Core offers this higher utilization and efficiency to edge devices in the following ways:

Non-latency critical, resource-intensive tasks can be run in the cloud where they can use managed services and be decommissioned when not in use.
Having less code on IoT devices also reduces maintenance efforts and attack surface while making it simpler to architect its software components for efficiency.
From a security perspective, AWS IoT Core protects and governs data exchange with the cloud in a central place. Each connected device must be credentialed to interact with AWS IoT. All traffic to and from AWS IoT is sent securely using Transport Layer Security (TLS) mutual authentication protocols. Services like AWS IoT Device Defender are available to analyze, audit, and monitor connected fleets of devices and cloud resources in AWS IoT at scale to detect abnormal behavior and mitigate security risks.

Customer Application:
Tibber, a Nordic energy startup, uses AWS IoT Core to securely exchange billions of messages per month about their clients’ real-time energy usage and aggregate data and perform analytics centrally. This allows them to keep their smart appliance lean and efficient while gaining access to scalable and more sustainable data processing capabilities.

Ensure device durability and longevity with AWS IoT Greengrass

Tasks like interacting with sensors or latency-critical computation must remain local. AWS IoT Greengrass, an edge runtime and cloud service, securely manages devices and device software, thereby enabling remote maintenance and secure OTA updates. It builds upon and extends the capabilities of AWS IoT Core and AWS IoT Device Management, which securely registers, organizes, monitors, and manages IoT devices.

AWS IoT Greengrass brings offline capabilities and simplifies the definition and distribution of business logic across Greengrass core devices. This allows for OTA updates of this business logic as well as the AWS IoT Greengrass Core software itself.

This is a distinctly different approach to what device manufacturers did in the past. Devices no longer need to be designed to run all code for one immutable purpose. Instead, they can be built to be flexible for potential future use cases, which ensures that business logic can be dynamically tweaked, maintained, and troubleshooted remotely when needed.

AWS IoT Greengrass does this using components. Components can represent applications, runtime installers, libraries, or any code that you would run on a device that are then distributed and managed through AWS IoT. Multiple AWS-provided components as well as the recently launched Greengrass Software Catalog extend the edge runtime’s default capabilities. The secure tunneling component, for example, establishes secure bidirectional communication with a Greengrass core device that is behind restricted firewalls, which can then be used for remote assistance and troubleshooting over SSH.

Conclusion

Historically, IoT devices were designed to stably and reliably serve one predefined purpose and were equipped for peak resource usage. However, as discussed in this post, to be sustainable, devices must now be lean, efficient, and durable. They must be manufactured, shipped, and installed once. From there, they should be able to be used flexibly for a long time. This way, they will consume less energy. Their smaller resource footprint and more efficient software allows organizations to improve operational efficiency but also fully realize their positive impact on emissions by minimizing devices’ carbon footprint throughout their lifecycle.

Ready to get started? Familiarize yourself with the topics of environmental sustainability and AWS IoT. Our AWS re:Invent 2021 Sustainability Attendee Guide covers this. When designing your IoT based solution, keep these device properties in mind. Follow the sustainability best practices described in the Sustainability Pillar of the AWS Well-Architected Framework.

Related information

Video: Architecting for sustainability
Video: Using Rust to minimize environmental impact
Video: Scaling sustainability solutions with IoT
Blog series: Optimizing your AWS Infrastructure for Sustainability Series
Magazine: AWS Architecture Monthly Magazine, August 2021, Sustainability

Financial Crime Discovery using Amazon EKS and Graph Databases

2022-02-01 Severin Gassauer-Fleissner

Post Syndicated from Severin Gassauer-Fleissner original https://aws.amazon.com/blogs/architecture/financial-crime-discovery-using-amazon-eks-and-graph-databases/

Discovering and solving financial crimes has become a challenge due to an increasing amount of financial data. While storing transactional payment data in a structured table format is useful for searching, filtering, and calculations, it is not always an ideal way to represent transactional data. For example, determining if there is a suspicious financial relationship between entity A and entity B is difficult to visualize in a table. Using tables, we would have to do SQL joins for every possible transaction from entity A to every possible subsequent transaction. We would then have to iterate this process until we found a relationship to entity B. Moreover, certain queries are challenging to run on a relational database management system (RDBMS). For example, it can be quite time consuming to discover which account received a minimum amount of $10,000,000 from other accounts.

Graph databases such as Amazon Neptune can be helpful with performing queries, because they can traverse the data and perform calculations simultaneously. Graphs enable us to represent transactions and parties over a multi-connected network, and discover patterns and chains of connections. It is common to use them in anti-money laundering (AML) applications, as they can help find patterns of suspicious transactions.

We needed a solution that could scale and process millions of transactions, by effectively using high memory and CPU configurations to perform complex queries quickly. As part of our customer demonstration to show how graph databases can help discover financial crimes, we also sourced a large dataset on which to test the solution. We used a graph database, Amazon Elastic Kubernetes Service (EKS), and Amazon Neptune, to search for suspicious financial chains across large amounts of transactional data in minutes.

Overview of our conceptual financial crime discovery solution

Figure 1. Workflow for financial crime discovery

We first needed to find a rules engine that could perform transaction inferencing and reasoning. It had to be able to process various rules on our data, be efficient, and able to scale. Next, we needed a straightforward way to ingest data into the solution. Once we had the data available, we needed to initiate a task to begin our inference job. Finally, we needed a location to store the result for further analysis and persistence, see Figure 1.

Using the AWS Cloud to scale up a graph database

We used multiple AWS services to create a fully automated end-to-end batch-based transaction process, shown in Figure 2. We used RDFox, which is an AWS Marketplace product, created by Oxford Semantic Technologies. RDFox is a high-performance in-memory graph database and semantic reasoner. To orchestrate RDFox, we used Amazon EKS Autoscaler to spin up a cluster to instantiate the RDFox container. Amazon EKS can spin up multiple containers for difference inference jobs and recycle the resources when the job finishes. We also used Amazon Neptune, an Amazon managed distributed graph database that can store up to 64 TiB of the results for diagnosis and long-term retention.

The data is stored in Amazon S3 buckets, which provide a streamlined way to feed a large dataset for processing.

Figure 2. Architecture diagram for financial crime discovery

Financial crimes rules

The power of graphs can help discover financial crimes that are reflected in relationships and monetary transactions. To demonstrate this, we will write two rules to detect two scenarios:

Given two suspicious parties X and Y, find out if there is a transactional relationship between them, and if so, provide the chain that connects them. This is a common scenario that financial institutions must detect.
Given a chain identifying suspicious behavior, find out if the minimum transaction amount that reached the beneficiary exceeds a threshold of Z dollars.

Generating data

To generate data for testing, we used a synthetic data generator written in Python (see GitHub repo in References). The generator built two sets of graph artifacts – parties and transactions. Every transaction is being paired with two random parties, and this iterative process creates a network of connected transactions and parties.

We created a dataset with a small percentage (0.01%) of parties tagged as “Suspicious Party,” to simulate the preceding business scenario. Note that those parties will have transactions going both in and out. In some cases, this will collide with other suspicious parties and establish a chain. This method enables us to get simulated data without engineering the suspicious chains.
The test dataset used with this solution comprises 1M transactions and thousands of parties.
For more information on generating data, see GitHub: Transaction Chains Data Generator.

Walkthrough of the financial investigator workflow

Once deployed, this solution can assist investigators as follows:

An investigator places the transactions and party data (nt triples) in a subdirectory within the input bucket. Typically, subdirectories can be named as a date or range of dates. In addition, the investigator uploads the particular rules (dlog files) and queries (rq files) that must be processed on the data.
Once the data is ready, the investigator uploads a job spec file (simple JSON, see References section). This contains the description of what resources the job requires (CPUs and memory), along with other configurations for the job.
Once the job spec has been uploaded to the bucket’s subdirectory, the job is automatically initiated. The Kubernetes scheduler will allocate enough resources to initiate the RDFox pod. The containers in the pod will then load, process, query results, and upload them to the output bucket.
Once the data reaches the output bucket, an AWS Lambda function is initiated. This invokes the Amazon Neptune Bulk Loader, which asynchronously loads the results to the Neptune cluster in a parallelized manner.
Once the load completes, the investigator gets an email notification that the job has been completed, and the results are ready for view.

Additionally:

The investigator can upload multiple rules and queries, they will all be processed automatically.
The investigator can launch multiple jobs with different/same data, and with different rules and queries at the same time.
All jobs outputs are saved in a unique job ID subdirectory in the output bucket.

To create the solution in your account, follow the instruction here: GitHub repo

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
RDFox license, request a free trial
Terraform
Familiarity with Kubernetes, Amazon Neptune, and SPARQL queries

Rules

We create two materialization rule sets to fetch the two scenarios described.

1. detect-suspicious-parties-pair.dlog

The purpose of this first set of rules is to detect chains that might exist between two suspicious parties. The idea of these chains is to represent all the possible relationships that contain monetary transfers between a suspicious originator and the beneficiary. This will include non-suspicious parties in the chain. The rule tags these chains with the “SuspiciousChain” flag.

2. detect-chains-exceed-100-dollars.dlog

This set of rules is designed to tag the chains identified by the first set of rules. It also contains a minimum amount of $100 passed to the beneficiary. We can change the amount to check for different compliance requirements. We tag those chains as “HighValueChain.”

Run the job and check your results

Now we can run our job, with the given data, rules, and two additional queries (to extract “SuspiciousChain” and “HighValueChain” respectively). The result of the queries will be loaded to Amazon Neptune automatically for persistent storage, and is made durably available for further analysis.

Let’s look at the results. The following query can be initiated against RDFox console or Amazon Neptune.

Visualization

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>

PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>

PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>

PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT ?S ?P ?O WHERE {

?S a type:SuspiciousChain .

?S ?P ?O .

}

Figure 3. Visualizing suspicious chains

Whoa! Figure 3 might look complicated at first, but this is because we are visualizing every pair of suspicious parties that have a relationship with another suspicious party. Let’s filter the query to look only at a single particular chain, which exceeds a minimum of $100 to the beneficiary. The following query can be executed against RDFox console or Amazon Neptune.

PREFIX : <http://oxfordsemantic.tech/transactions/entities#>
PREFIX prop: <http://oxfordsemantic.tech/transactions/properties#>
PREFIX type: <http://oxfordsemantic.tech/transactions/classes#>
PREFIX tt: <http://oxfordsemantic.tech/transactions/tupletables#>

SELECT * WHERE {
?S ?P ?O
{
SELECT ?S WHERE {
?S a type:HighValueChain .

} Limit 1
}
}

Figure 4. Visualizing a particular chain

In Figure 4, we can see that Allison, the suspicious originator of the chain, has sent a transaction to Troy. Troy, who is not suspicious, sent the transaction to Karina. Karina is the suspicious beneficiary. We can also see additional information, such as the transaction amount that Karina received, and the chain length of 2 in this case.

In our testing, we were able to scale up to 500M transactions with 50M parties and process this in less than two hours! And we performed this at a significant lower cost when compared to running constant, fixed similar hardware.

Cleaning up

Follow Terraform cleanup instructions.

Conclusion

Graph databases are a powerful tool to apply reasoning on complex financial relationships. The combination of Amazon Web Services and the RDFox engine results in an automated, scalable, and cost-effective, thanks to the dynamic Kubernetes Cluster Autoscaler. Customers can use this solution and provide their investigators with a tool they can experiment and reason on financial transactions. This solution simplifies the process, and makes it easier to try different rules and queries on complex large data collections.

This blog post is written with Oxford Semantic.

References

Solution:

Solution GitHub repository
Data generator repository
Low-level design documents and architecture
Examples for submitting a job

Scaling DLT to 1M TPS on AWS: Optimizing a Regulated Liabilities Network

2022-01-31 Erica Salinas

Post Syndicated from Erica Salinas original https://aws.amazon.com/blogs/architecture/scaling-dlt-to-1m-tps-on-aws-optimizing-a-regulated-liabilities-network/

SETL is an open source, distributed ledger technology (DLT) company that enables tokenisation, digital custody, and DLT for securities markets and payments. In mid-2021, they developed a blueprint for a Regulated Liabilities Network (RLN) that enables holding and managing a variety of tokenized value irrespective of its form. In a December 2021 collaboration with Amazon Web Services (AWS), SETL demonstrated that their RLN platform could support one million transactions per second. These findings were published in their whitepaper, The Regulated Liability Network: Whitepaper on scalability and performance.

This AWS-hosted architecture scaled each processing component and the complete transaction flow. It was built using Amazon Elastic Kubernetes Service (EKS), Amazon Elastic Cloud Compute (EC2), and Amazon Managed Streaming for Apache Kafka (MSK).

This post discusses the technical implementation of the simulated network. We show how scaling characteristics were achieved, while maintaining the business requirements of atomicity and finality. We also discuss how each RLN component was optimized for high performance.

Background: What is an RLN?

In The Regulated Internet of Value, Tony McLaughlin of Citi proposed a single network to record ownership of multiple token types, each which represents a liability of a regulated entity. The RLN would give commercial banks, e-money providers, and central banks, the ability to issue liabilities and immutably record all balances and owners. Because it is designed to maintain a globally sequenced set of state changes, an unambiguous ledger of token balances can be computed and updated. Transactions, which result in two or more ledger updates, are proposed to the network. They are authorized by all impacted network participants, ordered for distribution to participant ledgers, and then persisted by multiple participant ledgers. This process is completed with five processing components.

Figure 1. Core components and queues of RLN

Given the existing performance limitations faced by many DLT platforms, a main concern with the network design was its ability to meet throughput requirements. SETL collaborated with AWS to define an architecture that integrated decoupled processing components, as shown in Figure 2. The target throughput was 1,000,000 TPS for each component through the simulated network.

Figure 2. Simulated RLN architecture in the AWS Cloud

A queue – process – queue architectural model

The basic architectural model applied was “queue – process – queue.” Each component consumed transactions from one or more Kafka topics, performed its requisite activities, and produced output to a second Kafka topic. Components of the message flow used Amazon EKS, Amazon EC2, and Amazon MSK.

Specific techniques were applied to achieve the scaling characteristics of components in the main message flow.

A distinct EKS-managed node group was used for each RLN component. Each node within the group had its own EC2 instance that housed at most two Kubernetes Pods. This technique decreased potential network throttling and enabled the monitoring of pod resource utilization using Amazon CloudWatch at the EC2 level. This aided in identifying bottlenecks, which can be challenging if large EC2 nodes house many Kubernetes Pods.
Each RLN queue component had its own dedicated Kafka cluster. By isolating the Kafka topics to their own cluster, we were able to scale and monitor the cluster individually. This decreased potential chokepoints.
A combination of eksctl, kubectl, and EC2 Auto Scaling groups was used to deploy and horizontally scale each RLN component. The tooling enabled rapid results by controlling pod configuration, automating deployments, and supporting multiple performance testing iterations.

Fine tuning RLN system components

The key metric tested was message throughput in transactions per second (TPS) consumed, processed, and produced for each component. An end-to-end test was performed, which measured throughput of the complete simulated network. Each RLN component was tuned to meet this metric as follows.

The Transaction Generator acted as the customer entering transactions into the system (see Figure 3.) The Kafka producer property buffer.memory was changed to 536870912 to simulate an increase in producer TPS for each component.

Figure 3. Simulated RLN Transaction Generator

The Scheduler’s scaling technique was a dedicated Compute (one Pod/EC2 node) that listened to one Kafka partition in the Transaction Queue as seen in Figure 4. Additionally, turning on gzip compression (compression.type) on the producer resolved a network bottleneck for the downstream Kafka cluster.

Figure 4. Simulated RLN Scheduler

The Approver’s scaling technique was like the Scheduler, however, the Proposal Queue had 10 topics as opposed to one. The approver is stateless, so it randomly selects a Kafka topic and partition to listen to. When scaling to 100 Kubernetes Pods, there would be roughly one Kafka partition per pod. As shown in Figure 5, each pod has its own EC2 instance. The Assembler’s scaling techniques were the same as for the Scheduler and Approver components.

Figure 5. Simulated RLN Approver

The Sequencer, as designed, cannot scale horizontally due to the nature of the sequencing algorithm. However, without any special tuning of the Kafka consumer configuration or any significant increase in EC2 sizing, it was able to achieve one million TPS. Its architecture is shown in Figure 6.

Figure 6. Simulated RLN Sequencer

The State Updater scaled with each Kubernetes Pod listening to an assigned Approved Proposal topic (one-to-one mapping) and receiving all sequenced hashes from the Hash Queue. Achieving 1 million TPS throughput required 10 nodes configured with c5.4xlarge, as seen in Figure 7.

Figure 7. Simulated RLN State Updater

Successful throughput testing

Test results demonstrate that each component can sustain over 1 million TPS. While horizontal scaling was applied, even a single instance achieved over 10,000 TPS. Individual component performance test results can be seen in Table 1.

Component	Single Instance (TPS)	# of Instances	Multiple Instances (TPS)
Generator	~ 629,219	2	1,258,438
Scheduler	~11,019	100	1,101,928
Approver	~10,348	100	1,034,868
Assembler	~10,691	100	1,069,100
Sequencer	~1,052,845	1	N/A
State Updater	~100,256	10	1,002,569

Table 1. Test results per individual component

Network tests were performed with all components integrated to create a simulated RLN network. The tests were executed with transaction generation of ~1,000,000 TPS. Throughput was measured at the output of the final component’s (State Updater) instances (see Figure 8). It was also measured in aggregate at over 1.2 million TPS (see Figure 9).

Figure 8. Individual TPS for State Update component nodes

Figure 9. Aggregate TPS for State Update component nodes

While this study was designed to demonstrate capacity and throughput, the tests did replicate Kafka partitions across three Availability Zones in the selected AWS Region. Thus, testing demonstrated that the proposed architecture supports geographic resilience while meeting the throughput benchmark. Resiliency and security are areas of focus for testing in the next phases of architecting a production-ready version of RLN.

Looking forward

During the month of joint work between the SETL and AWS technical teams, we stood up and tuned basic RLN functionality. We successfully demonstrated the scalability of the environment to at least 1M TPS. By using Kafka queues and the horizontal and vertical scalability of AWS services, we addressed the primary concern of reaching production-level throughput.

The industry group collaborating on the RLN concept now has a solid technical foundation on which to build a production-ready RLN architecture. Future experimentation will include integration with financial institutions and technology partners that will provide the capabilities needed to build a successful RLN ecosystem.

Further explorations include enabling digital signing, verification at scale, and expanding the network to include financial institution environments integrated with their own RLN partitions.

Codacy Measures Developer Productivity using AWS Serverless

2022-01-27 Catarina Gralha

Post Syndicated from Catarina Gralha original https://aws.amazon.com/blogs/architecture/codacy-measures-developer-productivity-using-aws-serverless/

Codacy is a DevOps insights company based in Lisbon, Portugal. Since its launch in 2012, Codacy has helped software development and engineering teams reduce defects, keep technical debt in check, and ship better code, faster.

Codacy’s latest product, Pulse, is a service that helps understand and improve the performance of software engineering teams. This includes measuring metrics such as deployment frequency, lead time for changes, or mean time to recover. Codacy’s main platform is built on top of AWS products like Amazon Elastic Kubernetes Service (EKS), but they have taken Pulse one step further with AWS serverless.

In this post, we will explore the Pulse’s requirements, architecture, and the services it is built on, including AWS Lambda, Amazon API Gateway, and Amazon DynamoDB.

Pulse prototype requirements

Codacy had three clear requirements for their initial Pulse prototype.

The solution must enable the development team to iterate quickly and have minimal time-to-market (TTM) to validate the idea.
The solution must be easily scalable and match the demands of both startups and large enterprises alike. This was of special importance, as Codacy wanted to onboard Pulse with some of their existing customers. At the time, these customers already had massive amounts of information.
The solution must be cost-effective, particularly during the early stages of the product development.

Enter AWS serverless

Codacy could have built Pulse on top of Amazon EC2 instances. However, this brings the undifferentiated heavy lifting of having to provision, secure, and maintain the instances themselves.

AWS serverless technologies are fully managed services that abstract the complexity of infrastructure maintenance away from developers and operators, so they can focus on building products.

Serverless applications also scale elastically and automatically behind the scenes, so customers don’t need to worry about capacity provisioning. Furthermore, these services are highly available by design and span multiple Availability Zones (AZs) within the Region in which they are deployed. This gives customers higher confidence that their systems will continue running even if one Availability Zone is impaired.

AWS serverless technologies are cost-effective too, as they are billed per unit of value, as opposed to billing per provisioned capacity. For example, billing is calculated by the amount of time a function takes to complete or the number of messages published to a queue, rather than how long an EC2 instance runs. Customers only pay when they are getting value out of the services, for example when serving an actual customer request.

Overview of Pulse’s solution architecture

An event is generated when a developer performs a specific action as part of their day-to-day tasks, such as committing code or merging a pull request. These events are the foundational data that Pulse uses to generate insights and are thus processed by multiple Pulse components called modules.

Let’s take a detailed look at a few of them.

Ingestion module

Figure 1. Pulse ingestion module architecture

Figure 1 shows the ingestion module, which is the entry point of events into the Pulse platform and is built on AWS serverless applications as follows:

The ingestion API is exposed to customers using Amazon API Gateway. This defines REST, HTTP, and WebSocket APIs with sophisticated functionality such as request validation, rate limiting, and more.
The actual business logic of the API is implemented as AWS Lambda functions. Lambda can run custom code in a fully managed way. You only pay for the time that the function takes to run, in 1-millisecond increments. Lambda natively supports multiple languages, but customers can also bring their own runtimes or container images as needed.
API requests are authorized with keys, which are stored in Amazon DynamoDB, a key-value NoSQL database that delivers single-digit millisecond latency at any scale. API Gateway invokes a Lambda function that validates the key against those stored in DynamoDB (this is called a Lambda authorizer.)
While API Gateway provides a default domain name for each API, Codacy customizes it with Amazon Route 53, a service that registers domain names and configures DNS records. Route 53 offers a service level agreement (SLA) of 100% availability.
Events are stored in raw format in Pulse’s data lake, which is built on top of AWS’ object storage service, Amazon Simple Storage Service (S3). With Amazon S3, you can store massive amounts of information at low cost using simple HTTP requests. The data is highly available and durable.
Whenever a new event is ingested by the API, a message is published in Pulse’s message bus. (More information later in this post.)

Events module

Figure 2. Pulse events module architecture

The events module handles the aggregation and storage of events for actual consumption by customers, see Figure 2:

Events are consumed from the message bus and processed with a Lambda function, which stores them in Amazon Redshift.
Amazon Redshift is AWS’ managed data warehouse, and enables Pulse’s users to get insights and metrics by running analytical (OLAP) queries with the highest performance.
These metrics are exposed to customers via another API (the public API), which is also built on API Gateway.
The business logic for this API is implemented using Lambda functions, like the Ingestion module.

Message bus

Figure 3. Message bus architecture

We mentioned earlier that Pulse’s modules communicate messages with each other via the “message bus.” When something occurs at a specific component, a message (event) is published to the bus. At the same time, developers create subscriptions for each module that should receive these messages. This is known as the publisher/subscriber pattern (pub/sub for short), and is a fundamental piece of event-driven architectures.

With the message bus, you can decouple all modules from each other. In this way, a publisher does not need to worry about how many or who their subscribers are, or what to do if a new one arrives. This is all handled by the message bus.

Pulse’s message bus is built like this, shown in Figure 3:

Events are published via Amazon Simple Notification Service (SNS), using a construct called a topic. Topics are the basic unit of message publication and consumption. Components are subscribed to this topic, and you can filter out unwanted messages.
Developers configure Amazon SNS subscriptions to have the events sent to a queue, which provides a buffering layer from which workers can process messages. At the same time, queues also ensure that messages are not lost if there is an error. In Pulse’s case, these queues are implemented with Amazon Simple Queue Service (SQS).

Other modules

There are other parts of Pulse architecture that also use AWS serverless. For example, user authentication and sign-up are handled by Amazon Cognito, and Pulse’s frontend application is hosted on Amazon S3. This app is served to customers worldwide with low latency using Amazon CloudFront, a content delivery network.

Summary and next steps

By using AWS serverless, Codacy has been able to reduce the time required to bring Pulse to market by staying focused on developing business logic, rather than managing servers. Furthermore, Codacy is confident they can handle Pulse’s growth, as this serverless architecture will scale automatically according to demand.

Learn more about Serverless on AWS.
Visit Codacy to find out more about Pulse.

Let’s Architect! Architecture and Sustainability

2022-01-26 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-1-architecture-and-sustainability/

We often read news about sustainability and how governments and large corporations are working to build a better world for the future. But, have you ever asked yourself what you can do? As a software architect, how can you make a difference by addressing sustainability challenges?

In this first post of Let’s Architect!, a series of posts that gathers content to help software architects and tech leaders explore new ideas, case studies, and technical approaches, we provide materials to help you design sustainable architectures and create awareness on sustainability.

Optimizing your AWS Infrastructure for Sustainability

How do you optimize the compute layer of your environment from a sustainability perspective? An idle server still consumes power, and regulating its power consumption is one way to improve environmental impact. But, the cloud offers many other metrics and features to monitor and optimize your system.

This blog post shows you how to analyze the utilization of your compute resources, explains the main features to automatically scale based on demand, and highlights how serverless can optimize your resource utilization. Knowing how to use your resources efficiently will help reduce the amount of energy spent by your workload.

The shared responsibility model for sustainability shows how it is a shared responsibility between AWS and customers

Building Sustainably on AWS

This talk provides several best practices you can follow as to design more sustainable architectures. It gives different tips to integrate sustainable practices throughout business operations and provides some guardrails that could help you achieve your organization’s sustainability goals more quickly.

Luke Hargreaves explaining how to build sustainably on AWS

Moving to event-driven architectures

An efficient architecture is typically a more sustainable architecture. This video explains how Amazon.com approaches event-driven architectures.

Event-driven architectures use events to communicate across different microservices. This architectural pattern works to reduce bandwidth consumption and CPU utilization and potentially lower cost. By choosing a serverless event-driven architecture, you’ll optimize your overall resource utilization because the code is run in response to events.

Tim Bray presenting how to move to an event-driven architecture at re:Invent 2019

Supporting climate model simulations to accelerate climate science

This blog post discusses how collaborating research teams use the data generated through climate model simulations to study impacts on Earth and human systems—including agriculture, drought, flooding, and human health—in various parts of the world.

These studies will advance understanding of near-term climate and climate-intervention responses, and accelerate progress on a time-sensitive problem for humanity.

This architecture built on AWS ParallelCluster supports weather and climate modeling workloads

See you next time!

Thanks for reading! If you want to deep dive into the topic of sustainability even more, don’t miss the Architecture Monthly edition on Sustainability.

See you in a couple of weeks when we discuss novel ways to use machine learning and artificial intelligence!

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Minimizing Dependencies in a Disaster Recovery Plan

2022-01-26 Randy DeFauw

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/minimizing-dependencies-in-a-disaster-recovery-plan/

The Availability and Beyond whitepaper discusses the concept of static stability for improving resilience. What does static stability mean with regard to a multi-Region disaster recovery (DR) plan? What if the very tools that we rely on for failover are themselves impacted by a DR event?

In this post, you’ll learn how to reduce dependencies in your DR plan and manually control failover even if critical AWS services are disrupted. As a bonus, you’ll see how to use service control policies (SCPs) to help simulate a Regional outage, so that you can test failover scenarios more realistically.

Failover plan dependencies and considerations

Let’s dig into the DR scenario in more detail. Using Amazon Route 53 for Regional failover routing is a common pattern for DR events. In the simplest case, we’ve deployed an application in a primary Region and a backup Region. We have a Route 53 DNS record set with records for both Regions, and all traffic goes to the primary Region. In an event that triggers our DR plan, we manually or automatically switch the DNS records to direct all traffic to the backup Region.

Relying on an automated health check to control Regional failover can be tricky. A health check might not be perfectly reliable if a Region is experiencing some type of degradation. Often, we prefer to initiate our DR plan manually, which then initiates with automation.

What are the dependencies that we’ve baked into this failover plan? First, Route 53, our DNS service, has to be available. It must continue to serve DNS queries, and we have to be able to change DNS records manually. Second, if we do not have a full set of resources already deployed in the backup Region, we must be able to deploy resources into it.

Both dependencies might violate static stability, because we are relying on resources in our DR plan that might be affected by the outage we’re seeing. Ideally, we don’t want to depend on other services running so we can failover and continue to serve our own traffic. How do we reduce additional dependencies?

Static stability

Let’s look at our first dependency on Route 53 – control planes and data planes. Briefly, a control plane is used to configure resources, and the data plane delivers services (see Understanding Availability Needs for a more complete definition.)

The Route 53 data plane, which responds to DNS queries, is highly resilient across Regions. We can safely rely on it during the failure of any single Region. But let’s assume that for some reason we are not able to call on the Route 53 control plane.

Amazon Route 53 Application Recovery Controller (Route 53 ARC) was built to handle this scenario. It provisions a Route 53 health check that we can manually control with a Route 53 ARC routing control, and is a data plane operation. The Route 53 ARC data plane is highly resilient, using a cluster of five Regional endpoints. You can revise the health check if three of the five Regions are available.

Figure 1. Simple Regional failover scenario using Route 53 Application Recovery Controller

The second dependency, being able to deploy resources into the second Region, is not a concern if we run a fully scaled-out set of resources. We must make sure that our deployment mechanism doesn’t rely only on the primary Region. Most AWS services have Regional control planes, so this isn’t an issue.

The AWS Identity and Access Management (IAM) data plane is highly available in each Region, so you can authorize the creation of new resources as long as you’ve already defined the roles. Note: If you use federated authentication through an identity provider, you should test that the IdP does not itself have a dependency on another Region.

Testing your disaster recovery plan

Once we’ve identified our dependencies, we need to decide how to simulate a disaster scenario. Two mechanisms you can use for this are network access control lists (NACLs) and SCPs. The first one enables us to restrict network traffic to our service endpoints. However, the second allows defining policies that specify the maximum permissions for the target accounts. It also allows us to simulate a Route 53 or IAM control plane outage by restricting access to the service.

For the end-to-end DR simulation, we’ve published an AWS samples repository on GitHub that you can use to deploy. This evaluates Route 53 ARC capabilities if both Route 53 and IAM control planes aren’t accessible.

By deploying test applications across us-east-1 and us-west-1 AWS Regions, we can simulate a real-world scenario that determines the business continuity impact, failover timing, and procedures required for successful failover with unavailable control planes.

Figure 2. Simulating Regional failover using service control policies

Before you conduct the test outlined in our scenario, we strongly recommend that you create a dedicated AWS testing environment with an AWS Organizations setup. Make sure that you don’t attach SCPs to your organization’s root but instead create a dedicated organization unit (OU). You can use this pattern to test SCPs and ensure that you don’t inadvertently lock out users from key services.

Chaos engineering

Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent production conditions. Chaos engineering and its principles are important tools when you plan for disaster recovery. Even a simple distributed system may be too complex to operate reliably. It can be hard or impossible to plan for every failure scenario in non-trivial distributed systems, because of the number of failure permutations. Chaos experiments test these unknowns by injecting failures (for example, shutting down EC2 instances) or transient anomalies (for example, unusually high network latency.)

In the context of multi-Region DR, these techniques can help challenge assumptions and expose vulnerabilities. For example, what happens if a health check passes but the system itself is unhealthy, or vice versa? What will you do if your entire monitoring system is offline in your primary Region, or too slow to be useful? Are there control plane operations that you rely on that themselves depend on a single AWS Region’s health, such as Amazon Route 53? How does your workload respond when 25% of network packets are lost? Does your application set reasonable timeouts or does it hang indefinitely when it experiences large network latencies?

Questions like these can feel overwhelming, so start with a few, then test and iterate. You might learn that your system can run acceptably in a degraded mode. Alternatively, you might find out that you need to be able to failover quickly. Regardless of the results, the exercise of performing chaos experiments and challenging assumptions is critical when developing a robust multi-Region DR plan.

Conclusion

In this blog, you learned about reducing dependencies in your DR plan. We showed how you can use Amazon Route 53 Application Recovery Controller to reduce a dependency on the Route 53 control plane, and how to simulate a Regional failure using SCPs. As you evaluate your own DR plan, be sure to take advantage of chaos engineering practices. Formulate questions and test your static stability assumptions. And of course, you can incorporate these questions into a custom lens when you run a Well-Architected review using the AWS Well-Architected Tool.

Building Resilient and High Performing Cloud-based Applications in Hawaii

2022-01-21 Marie Yap

Post Syndicated from Marie Yap original https://aws.amazon.com/blogs/architecture/building-resilient-and-high-performing-cloud-based-applications-in-hawaii/

Hawaii is building a digital economy for a sustainable future. Many local businesses are already embarking on their journey to the cloud to meet their customers’ growing demand for digital services. To access Amazon Web Services (AWS) on the US mainland, customers’ data must traverse through submarine fiber-optic cable networks approximately 2,800 miles across the Pacific Ocean. As a result, organizations have two primary concerns:

Resiliency concerns about multiple outage events that could arise from breaks in the submarine cables.
Latency concerns for mission-critical applications driven by physical distance.

These problems can be solved by architecting the workloads for reliability, secure connectivity, and high performance.

Designing network connectivity that is reliable, secure, and highly performant

A typical workload in AWS can be broken down into three layers – Network, Infrastructure, and Application. For each layer, we can design for resiliency and latency concerns. Starting at the network layer, there are two recommended options for connecting the on-premises network within the island to AWS.

Use of AWS Direct Connect over a physical connection. AWS Direct Connect is a dedicated network connection that connects your on-premises environment to AWS Regions. In this case, the connection is traversing the fiber-optic cable across the Pacific Ocean to the mainland’s meet-me-point facilities. It can be provisioned from 50 Mbps up to 100 Gbps. This provides you with a presence in an AWS Direct Connect location, a third-party colocation facility, or an Internet Service Provider (ISP) that provides last-mile connectivity to AWS. In addition, the Direct Connect location establishes dedicated connectivity to Amazon Virtual Private Clouds (VPC). This improves application performance and addresses latency concerns by connecting directly to AWS and bypassing the public internet.
Use of AWS VPN over an internet connection. As a secondary option to Direct Connect, AWS Site-to-Site VPN provide connectivity into AWS over the public internet using VPN encryption technologies. The Site-to-Site VPN connects on-premises sites to AWS resources in an Amazon VPC. As a result, you can securely connect your on-premises network to AWS using an internet connection.

We recommend choosing the us-west-2 AWS Region in Oregon to build high performant connectivity closest to Hawaii. The us-west-2 Region generally provides more AWS services at a lower cost versus us-west-1. In addition, there are various options for AWS Direct Connect Locations in the US West Region. Many of these locations support up to 100 Gbps and support MACsec, which is an IEEE standard for security encryption in wired Ethernet LANs. Typically, customers will use multiple 10-Gbps connections for higher throughput and redundancy.

Subsea Cable	Hawaii Cable Landing Station	Mainland Cable Landing Station	Nearest Direct Connect Location
Southern Cross Cable Network (SCCN)	Kahe Point (Oahu)	Morro Bay, CA	CoreSite, Equinix
Southern Cross Cable Network (SCCN)	Kahe Point (Oahu)	Hillsboro, OR	Equnix, EdgeConnex, Pittock Block, CoreSite, T5, TierPoint
Hawaiki	Kapolei (Oahu)	Hillsboro, OR	Equnix, EdgeConnex, Pittock Block, CoreSite, T5, TierPoint
Asia-America Gateway (AAG)	Keawaula (Oahu)	San Luis Obispo, CA	CoreSite, Equinix
Japan-US Cable Network (JUS)	Makaha (Oahu)	Morro Bay, CA	CoreSite, Equinix
SEA-US	Makaha (Oahu)	Hermosa Beach, CA	CoreSite, Equinix, T5

Table 1. Subsea fiber-optic cables connecting Hawaii to the US mainland

(Source: Submarine Cable Map from TeleGeography)

To build resilient connectivity, six cables connect Hawaii to the mainland US: Hawaiki, SEA-US, Asia-America Gateway (AAG), Japan-US (JUS), and two Southern Cross (SCCN) cables. In addition, these cables connect to various locations on the US West Coast. If you require high resiliency, we recommend a minimum of two physically redundant Direct Connect connections into AWS. In addition, we recommend designing four Direct Connect connections that span two Direct Connect locations for maximum resiliency. If you build your architecture following these recommendations, AWS offers this published service level agreement (SLA).

Figure 1. Redundant direct connection from Hawaii to the US mainland

Most customers select an ISP to get them connectivity across the Pacific Ocean to an AWS Direct location. The Direct Connect locations are third-party colocation providers who act as meet-me points for AWS customers and the AWS Regions. For example, our local AWS Partner DRFortress connects multiple ISPs in a data center in Hawaii to the AWS US West Region. We recommend having at least two ISPs for resilient applications, each providing connectivity across a separate subsea cable from Hawaii to the mainland. If one cable should fail for any reason, connectivity to AWS would still be available. The red links in figure 2 are the ISP-provided connectivity that spans the Pacific Ocean. This is a minimum starting point for business-critical applications and should be designed with additional Direct Connect links for greater resiliency.

Architecting for high performance and resiliency

Moving from the network to the infrastructure and application layer, organizations have the option in building their application all in the cloud or in combination with an on-premises environment. An example of an application built all in the cloud is the LumiSight platform in AWS built by local AWS Partner, DataHouse. LumiSight has helped dozens of organizations quickly and securely reopen safely during the pandemic.

Other customers need a hybrid cloud architecture solution. These organizations require that their data processing and locally hosted applications analysis is close to other components within the island’s data center. With this proximity, they can deliver near real-time responses to their end users. AWS Outposts Family extends the capabilities of an AWS Region to the island. This enables local businesses to build and run low latency applications on-premises on an AWS fully managed infrastructure. You can now deploy Compute, Storage, Containers, Data Analytics clusters, Relational, and Cache databases in high performance, redundant and secure infrastructure maintained by AWS. Outposts can be shipped to Hawaii, connecting to the us-west-1 or us-west-2 Regions.

Another option for improving application performance is providing an efficient virtual desktop to access their applications anywhere. Amazon WorkSpaces provides a secure, managed cloud-based virtual desktop experience. Many workers who bring their own device (BYOD) or work remotely use Workspaces to access their corporate applications securely. Workspaces use streaming protocols that provide a secure and responsive desktop experience to end users located in remote Regions, like Hawaii. Workspaces can quickly provide a virtual desktop without managing the infrastructure, OS versions, and patches. You can test your connection to Workspaces from Hawaii, or anywhere else in the world, at the Connection Health Check page.

Architecting for resiliency in the infrastructure and application stack is vital for Business Continuity and Disaster Recovery (BCDR) plans. Organizations in Hawaii who are already using VMware can take advantage of creating a recovery site using VMware Cloud on AWS as their solution for disaster recovery. The VMware Cloud on AWS is a fully managed VMware software-defined Data Center (SDDC) running on AWS, which provides access to native AWS services. Organizations can pair their on-premises vCenter and virtual machines to the fully managed vCenter and virtual machines residing in the cloud. The active Site Recovery Manager provides the automation of failing over and failing back applications between on-premises to the cloud DR site and vice versa. Additionally, organizations can define their SDDC in the us-west-2 Region using AWS Direct Connect to minimize the latency of replicating the data from and to the data centers in the islands.

Conclusion

Organizations in Hawaii can build resilient and high performant cloud-based workloads with the help of AWS services in each layer of their workloads. Starting with the network layer, you can establish reliable and lower latency connectivity through redundant AWS Direct Connect connections. Next, for low latency, hybrid applications, we extend infrastructure capabilities locally through AWS Outposts. We also improve the user experience in accessing cloud-based applications by providing Amazon WorkSpaces as the virtual desktop. Finally, we build resilient infrastructure and applications using a familiar solution called VMware Cloud on AWS.

To start learning the fundamentals and building on AWS, visit the Getting Started Resource Center.

How Ribbon Communications Built a Scalable, Resilient Robocall Mitigation Platform

2022-01-14 Siva Rajamani

Post Syndicated from Siva Rajamani original https://aws.amazon.com/blogs/architecture/how-ribbon-communications-built-a-scalable-resilient-robocall-mitigation-platform/

Ribbon Communications provides communications software, and IP and optical networking end-to-end solutions that deliver innovation, unparalleled scale, performance, and agility to service providers and enterprise.

Ribbon Communications is helping customers modernize their networks. In today’s data-hungry, 24/7 world, this equates to improved competitive positioning and business outcomes. Companies are migrating from on-premises equipment for telephony services and looking for equivalent as a service (aaS) offerings. But these solutions must still meet the stringent resiliency, availability, performance, and regulatory requirements of a telephony service.

The telephony world is inundated with robocalls. In the United States alone, there were an estimated 50.5 billion robocalls in 2021! In this blog post, we describe the Ribbon Identity Hub – a holistic solution for robocall mitigation. The Ribbon Identity Hub enables services that sign and verify caller identity, which is compliant to the ATIS standards under the STIR/SHAKEN framework. It also evaluates and scores calls for the probability of nuisance and fraud.

Ribbon Identity Hub is implemented in Amazon Web Services (AWS). It is a fully managed service for telephony service providers and enterprises. The solution is secure, multi-tenant, automatic scaling, and multi-Region, and enables Ribbon to offer managed services to a wide range of telephony customers. Ribbon ensures resiliency and performance with efficient use of resources in the telephony environment, where load ratios between busy and idle time can exceed 10:1.

Ribbon Identity Hub

The Ribbon Identity Hub services are separated into a data (call-transaction) plane, and a control plane.

Data plane (call-transaction)

The call-transaction processing is typically invoked on a per-call-setup basis where availability, resilience, and performance predictability are paramount. Additionally, due to high variability in load, automatic scaling is a prerequisite.

Figure 1. Data plane architecture

Several AWS services come together in a solution that meets all these important objectives:

Amazon Elastic Container Service (ECS): The ECS services are set up for automatic scaling and span two Availability Zones. This provides the horizontal scaling capability, the self-healing capacity, and the resiliency across Availability Zones.
Elastic Load Balancing – Application Load Balancer (ALB): This provides the ability to distribute incoming traffic to ECS services as the target. In addition, it also offers:
- Seamless integration with the ECS Auto Scaling group. As the group grows, traffic is directed to the new instances only when they are ready. As traffic drops, traffic is drained from the target instances for graceful scale down.
- Full support for canary and linear upgrades with zero downtime. Maintains full-service availability without any changes or even perception for the client devices.
Amazon Simple Storage Service (S3): Transaction detail records associated with call-related requests must be securely and reliably maintained for over a year due to billing and other contractual obligations. Amazon S3 simplifies this task with high durability, lifecycle rules, and varied controls for retention.
Amazon DynamoDB: Building resilient services is significantly easier when the compute processing can be stateless. Amazon DynamoDB facilitates such stateless architectures without compromise. Coupled with the availability of the Amazon DynamoDB Accelerator (DAX) caching layer, the solution can meet the extreme low latency operation requirements.
AWS Key Management Service (KMS): Certain tenant configuration is highly confidential and requires elevated protection. Furthermore, the data is part of the state that must be recovered across Regions in disaster recovery scenarios. To meet the security requirements, the KMS is used for envelope encryption using per-tenant keys. Multi-Region KMS keys facilitates the secure availability of this state across Regions without the need for application-level intervention when replicating encrypted data.
Amazon Route 53: For telephony services, any non-transient service failure is unacceptable. In addition to providing high degree of resiliency through Multi-AZ architecture, Identity Hub also provides Regional level high availability through its multi-Region active-active architecture. Route 53 with health checks provides for dynamic rerouting of requests within minutes to alternate Regions.

Control plane

The Identity Hub control plane is used for customer configuration, status, and monitoring. The API is REST-based. Since this is not used on a call-by-call basis, the requirements around latency and performance are less stringent, though the requirements around high resiliency and dynamic scaling still apply. In this area, ease of implementation and maintainability are key.

Figure 2. Control plane architecture

The following AWS services implement our control plane:

Amazon API Gateway: Coupled with a custom authenticator, the API Gateway handles all the REST API credential verification and routing. Implementation of an API is transformed into implementing handlers for each resource, which is the application core of the API.
AWS Lambda: All the REST API handlers are written as Lambda functions. By using the Lambda’s serverless and concurrency features, the application automatically gains self-healing and auto-scaling capabilities. There is also a significant cost advantage as billing is per millisecond of actual compute time used. This is significant for a control plane where usage is typically sparse and unpredictable.
Amazon DynamoDB: A stateless architecture with Lambda and API Gateway, all persistent state must be stored in an external database. The database must match the resilience and auto-scaling characteristics of the rest of the control plane. DynamoDB easily fits the requirements here.

The customer portal, in addition to providing the user interface for control plane REST APIs, also delivers a rich set of user-customizable dashboards and reporting capability. Here again, the availability of various AWS services simplifies the implementation, and remains non-intrusive to the central call-transaction processing.

Services used here include:

AWS Glue: Enables extraction and transformation of raw transaction data into a format useful for reporting and dashboarding. AWS Glue is particularly useful here as the data available is regularly expanding, and the use cases for the reporting and dashboarding increase.
Amazon QuickSight: Provides all the business intelligence (BI) functionality, including the ability for Ribbon to offer separate author and reader access to their users, and implements tenant-based access separation.

Conclusion

Ribbon has successfully deployed Identity Hub to enable cloud hosted telephony services to mitigate robocalls. Telephony requirements around resiliency, performance, and capacity were not compromised. Identity Hub offers the benefits of a 24/7 fully managed service requiring no additional customer on-premises equipment.

Choosing AWS services for Identity Hub gives Ribbon the ability to scale and meet future growth. The ability to dynamically scale the service in and out also brings significant cost advantages in telephony applications where busy hour traffic is significantly higher than idle time traffic. In addition, the availability of global AWS services facilitates the deployment of services in customer-local geographic locations to meet performance requirements or local regulatory compliance.

How Experian uses Amazon SageMaker to Deliver Affordability Verification

2022-01-13 Haresh Nandwani

Post Syndicated from Haresh Nandwani original https://aws.amazon.com/blogs/architecture/how-experian-uses-amazon-sagemaker-to-deliver-affordability-verification/

Financial Service (FS) providers must identify patterns and signals in a customer’s financial behavior to provide deeper, up-to-the-minute, insight into their affordability and credit risk. FS providers use these insights to improve decision making and customer management capabilities. Machine learning (ML) models and algorithms play a significant role in automating, categorising, and deriving insights from bank transaction data.

Experian publishes Categorisation-as-a-Service (CaaS) ML models that automate analysis of bank and credit card transactions, to be deployed in Amazon SageMaker. Driven by a suite of Experian proprietary algorithms, these models categorise a customer’s bank or credit card transactions into one of over 180 different income and expenditure categories. The service turns these categorised transactions into a set of summarised insights that can help a business better understand their customer and make more informed decisions. These insights provide a detailed picture of a customer’s financial circumstances and resilience by looking at verified income, expenditure, and credit behavior.

This blog demonstrates how financial service providers can introduce affordability verification and categorisation into their digital journeys by deploying Experian CaaS ML models on SageMaker. You don’t need significant ML knowledge to start using Amazon SageMaker and Experian CaaS.

Affordability verification and data categorisation in digital journeys

Product onboarding journeys are increasingly digital. Most financial service providers expect most of these journeys to initiate and complete online. An example journey would be consumers looking to apply for credit with their existing FS provider. These journeys typically involve FS providers performing affordability verification to ensure consumers are offered products they can afford. FS providers can now use Experian CaaS ML models available via AWS Marketplace to generate real-time financial insights and affordability verification for their customers.

Figure 1 depicts a typical digital journey for consumers applying for credit.

Figure 1. Customer journey for consumers applying for credit

Data categorisation for transactional data. Existing transactional data for current consumers is typically sourced from on-premises data sources into a data lake in the cloud. It is then prepared and transformed for processing and analytics. This analysis is done based on the FS provider’s existing consent in compliance with relevant data protection laws. Additional transaction information for other accounts not held by the lender can be sourced from Open Banking and categorised separately.
Store categorised transactions. Background processes run a SageMaker batch transform job using the Experian CaaS Data Categorisation model to categorise this transactional data.
Consumer applies for credit. Consumers use the FS providers’ existing front-end web, mobile, or any other digital channel to apply for credit.
FS provider retrieves up-to-date insights. Insights are generated in real time using the Experian CaaS insights model deployed as endpoints in SageMaker and returned to the consumer-facing digital channel.
FS provider makes credit decision. The channel app consolidates these insights to decide on product eligibility and drive customer journeys.

Deploying and publishing Experian CaaS ML models to Amazon SageMaker

Figure 2 demonstrates the technical solution for the customer journey described in the preceding section.

Figure 2. Credit application – technical solution using Amazon SageMaker and Experian CaaS ML models

Financial Service providers can use AWS Data Migration Service (AWS DMS) to replicate transactional data from their on-premises systems such as their core banking systems to Amazon S3. Customers can source this transactional data into a highly available and scalable data lake solution on AWS. Refer to AWS DMS documentation for technical details on supported database sources.
FS providers can use AWS Glue, a serverless data integration service, to cleanse, prepare, and transform the transactional data into formats supported by the Experian CaaS ML models.
FS providers can subscribe and download CaaS ML models built for SageMaker from the AWS Marketplace.
These models can be deployed to SageMaker hosting services as a SageMaker endpoint for real-time inference. Endpoints are fully managed by AWS, and can be set up to scale on demand and deployed in a Multi-AZ model for resilience. FS providers can use Amazon API Gateway and AWS Lambda to make these endpoints available to their consumer-facing applications.
SageMaker also supports a batch transform mode for ML models, which in this scenario will be used to precategorise transactional data. This mode is also useful for use cases that require nearly continuous and regular analysis such as a regular anti-fraud assessment.
Consumer requests for a financial product such as a credit card on an FS provider’s digital channels.
These requests invoke SageMaker endpoints, which use Experian CaaS models to derive real-time insights.
These insights are used to further drive the customer’s product journey. CaaS models are pre-trained and can return insights within the latency requirements of most real-time digital journeys.

Security and compliance using CaaS

AWS Marketplace models are scanned by AWS for common vulnerabilities and exposures (CVE). CVE is a list of publicly known information about security vulnerability and exposure. For details on infrastructure security applied by SageMaker, see Infrastructure Security in Amazon SageMaker.

Data security is a key concern for FS providers and sharing of data externally is challenging from a security and compliance perspective. The CaaS deployment model described here helps address these challenges as data owned by the FS provider remains within their control domain and AWS account. There is no requirement for this data to be shared with Experian. This means the customer’s personal financial information is retained by the FS provider. FS providers cannot access the model code as it is running in a locked SageMaker environment.

AWS Marketplace models such as the Experian CaaS ML models are deployed in a network isolation mode. This ensures that the models cannot make any outbound network calls, even to other AWS services such as Amazon S3. SageMaker still performs download and upload operations against Amazon S3 in isolation from the model.

Implementing upgrades to CaaS ML models

ML model upgrades can be performed in place in Amazon SageMaker as vendors release newer versions of their models in AWS Marketplace. Endpoints can be set up in a blue/green deployment pattern to ensure that upgrades do not impact consumers and be safely rolled back with no business interruptions.

Conclusion

Automated categorisation of bank transaction data is now being used by FS providers as they start to realise the benefits it can bring to their business. This is being driven in part by the advent of Open Banking. Many FS providers have increased confidence in the accuracy and performance of automated categorisation engines. Suppliers such as Experian are providing transparency around their methodologies used to categorise data, which is also encouraging adoption.

In this blog, we covered how FS providers can introduce automated categorisation of data and affordability identification capabilities into their digital journeys. This can be done quickly and without significant in-house ML skills, using Amazon SageMaker and Experian CaaS ML models. SageMaker endpoints and batch transform capabilities enable the deployment of a highly scalable, secure, and extensible ML infrastructure with minimal development and operational effort.

Experian’s CaaS is available for use via the AWS Marketplace.

Creating a Multi-Region Application with AWS Services – Part 2, Data and Replication

2022-01-12 Joe Chapman

Post Syndicated from Joe Chapman original https://aws.amazon.com/blogs/architecture/creating-a-multi-region-application-with-aws-services-part-2-data-and-replication/

In Part 1 of this blog series, we looked at how to use AWS compute, networking, and security services to create a foundation for a multi-Region application.

Data is at the center of many applications. In this post, Part 2, we will look at AWS data services that offer native features to help get your data where it needs to be.

In Part 3, we’ll look at AWS application management and monitoring services to help you build, monitor, and maintain a multi-Region application.

Considerations with replicating data

Data replication across the AWS network can happen quickly, but we are still limited by the speed of light. For this reason, data consistency must be considered when building a multi-Region application. Generally speaking, the longer a physical distance is, the longer it will take the data to get there.

When building a distributed system, consider the consistency, availability, partition tolerance (CAP) theorem. This theorem states that an application can only pick 2 out of the 3, and tradeoffs should be considered.

Consistency – all clients always have the same view of data
Availability – all clients can always read and write data
Partition Tolerance – the system will continue to work despite physical partitions

Achieving consistency and availability is common for single-Region applications. For example, when an application connects to a single in-Region database. However, this becomes more difficult with multi-Region applications due to the latency added by transferring data over long distances. For this reason, highly distributed systems will typically follow an eventual consistency approach, favoring availability and partition tolerance.

Replicating objects and files

To ensure objects are in multiple Regions, Amazon Simple Storage Service (Amazon S3) can be set up to replicate objects across AWS Regions automatically with one-way or two-way replication. A subset of objects in an S3 bucket can be replicated with S3 replication rules. If low replication lag is critical, S3 Replication Time Control can help meet requirements by replicating 99.99% of objects within 15 minutes, and most within seconds. To monitor the replication status of objects, Amazon S3 events and metrics will track replication and can send an alert if there’s an issue.

Traditionally, each S3 bucket has its own single, Regional endpoint. To simplify connecting to and managing multiple endpoints, S3 Multi-Region Access Points create a single global endpoint spanning multiple S3 buckets in different Regions. When applications connect to this endpoint, it will route over the AWS network using AWS Global Accelerator to the bucket with the lowest latency. Failover routing is also automatically handled if the connectivity or availability to a bucket changes.

For files stored outside of Amazon S3, AWS DataSync simplifies, automates, and accelerates moving file data across Regions and accounts. It supports homogeneous and heterogeneous file migrations across Elastic File System (Amazon EFS), Amazon FSx, AWS Snowcone, and Amazon S3. It can even be used to sync on-premises files stored on NFS, SMB, HDFS, and self-managed object storage to AWS for hybrid architectures.

File and object replication should be expected to be eventually consistent. The rate at which a given dataset can transfer is a function of the amount of data, I/O bandwidth, network bandwidth, and network conditions.

Copying backups

Scheduled backups can be set up with AWS Backup, which automates backups of your data to meet business requirements. Backup plans can automate copying backups to one or more AWS Regions or accounts. A growing number of services are supported, and this can be especially useful for services that don’t offer real-time replication to another Region such as Amazon Elastic Block Store (Amazon EBS) and Amazon Neptune.

Figure 1 shows how these data transfer services can be combined for each resource.

Figure 1. Storage replication services

Spanning non-relational databases across Regions

Amazon DynamoDB global tables provide multi-Region and multi-writer features to help you build global applications at scale. A DynamoDB global table is the only AWS managed offering that allows for multiple active writers in a multi-Region topology (active-active and multi-Region). This allows for applications to read and write in the Region closest to them, with changes automatically replicated to other Regions.

Global reads and fast recovery for Amazon DocumentDB (with MongoDB compatibility) can be achieved with global clusters. These clusters have a primary Region that handles write operations. Dedicated storage-based replication infrastructure enables low-latency global reads with a lag of typically less than one second.

Keeping in-memory caches warm with the same data across Regions can be critical to maintain application performance. Amazon ElastiCache for Redis offers global datastore to create a fully managed, fast, reliable, and secure cross-Region replica for Redis caches and databases. With global datastore, writes occurring in one Region can be read from up to two other cross-Region replica clusters – eliminating the need to write to multiple caches to keep them warm.

Spanning relational databases across Regions

For applications that require a relational data model, Amazon Aurora global database provides for scaling of database reads across Regions in Aurora PostgreSQL-compatible and MySQL-compatible editions. Dedicated replication infrastructure utilizes physical replication to achieve consistently low replication lag that outperforms the built-in logical replication database engines offer, as shown in Figure 2.

Figure 2. SysBench OLTP (write-only) stepped every 600 seconds on R4.16xlarge

With Aurora global database, one primary Region is designated as the writer, and secondary Regions are dedicated to reads. Aurora MySQL supports write forwarding, which forwards write requests from a secondary Region to the primary Region to simplify logic in application code. Failover testing can happen by utilizing managed planned failover, which will change the active write cluster to another Region while keeping the replication topology intact. All databases discussed in this post employ eventual consistency when used across Regions, but Aurora PostgreSQL has an option to set the maximum a replica lag allowed with managed recovery point objective (managed RPO).

Logical replication, which utilizes a database engine’s built-in replication technology, can be set up for Amazon Relational Database Service (Amazon RDS) for MariaDB, MySQL, Oracle, PostgreSQL, and Aurora databases. A cross-Region read replica will receive these changes from the writer in the primary Region. For applications built on RDS for Microsoft SQL Server, cross-Region replication can be achieved by utilizing the AWS Database Migration Service. Cross-Region replicas allow for quicker local reads and can reduce data loss and recovery times in the case of a disaster by being promoted to a standalone instance.

For situations where a longer RPO and recovery time objective (RTO) are acceptable, backups can be copied across Regions. This is true for all of the relational and non-relational databases mentioned in this post, except for ElastiCache for Redis. Amazon Redshift can also automatically do this for your data warehouse. Backup copy times will vary depending on size and change rates.

A purpose-built database strategy offers many benefits, Figure 3 forms a purpose-built global database architecture.

Figure 3. Purpose-built global database architecture

Summary

Data is at the center of almost every application. In this post, we reviewed AWS services that offer cross-Region data replication to get your data where it needs to be quickly. Whether you need faster local reads, an active-active database, or simply need your data durably stored in a second Region, we have a solution for you. In the 3rd and final post of this series, we’ll cover application management and monitoring features.

Ready to get started? We’ve chosen some AWS Solutions, AWS Blogs, and Well-Architected labs to help you!

Using Amazon Aurora Global Database for Low Latency without Application Changes

2022-01-11 Roneel Kumar

Post Syndicated from Roneel Kumar original https://aws.amazon.com/blogs/architecture/using-amazon-aurora-global-database-for-low-latency-without-application-changes/

Deploying global applications has many challenges, especially when accessing a database to build custom pages for end users. One example is an application using AWS Lambda@Edge. Two main challenges include performance and availability.

This blog explains how you can optimally deploy a global application with fast response times and without application changes.

The Amazon Aurora Global Database enables a single database cluster to span multiple AWS Regions by asynchronously replicating your data within subsecond timing. This provides fast, low-latency local reads in each Region. It also enables disaster recovery from Region-wide outages using multi-Region writer failover. These capabilities minimize the recovery time objective (RTO) of cluster failure, thus reducing data loss during failure. You will then be able to achieve your recovery point objective (RPO).

However, there are some implementation challenges. Most applications are designed to connect to a single hostname with atomic, consistent, isolated, and durable (ACID) consistency. But Global Aurora clusters provide reader hostname endpoints in each Region. In the primary Region, there are two endpoints, one for writes, and one for reads. To achieve strong data consistency, a global application requires the ability to:

Choose the optimal reader endpoints
Change writer endpoints on a database failover
Intelligently select the reader with the most up-to-date, freshest data

These capabilities typically require additional development.

The Heimdall Proxy coupled with Amazon Route 53 allows edge-based applications to access the Aurora Global Database seamlessly, without application changes. Features include automated Read/Write split with ACID compliance and edge results caching.

Figure 1. Heimdall Proxy architecture

The architecture in Figure 1 shows Aurora Global Databases primary Region in AP-SOUTHEAST-2, and secondary Regions in AP-SOUTH-1 and US-WEST-2. The Heimdall Proxy uses latency-based routing to determine the closest Reader Instance for read traffic, and redirects all write traffic to the Writer Instance. The Heimdall Configuration stores the Amazon Resource Name (ARN) of the global cluster. It automatically detects failover and cross-Region on the cluster, and directs traffic accordingly.

With an Aurora Global Database, there are two approaches to failover:

Managed planned failover. To relocate your primary database cluster to one of the secondary Regions in your Aurora global database, see Managed planned failovers with Amazon Aurora Global Database. With this feature, RPO is 0 (no data loss) and it synchronizes secondary DB clusters with the primary before making any other changes. RTO for this automated process is typically less than that of the manual failover.
Manual unplanned failover. To recover from an unplanned outage, you can manually perform a cross-Region failover to one of the secondaries in your Aurora Global Database. The RTO for this manual process depends on how quickly you can manually recover an Aurora global database from an unplanned outage. The RPO is typically measured in seconds, but this is dependent on the Aurora storage replication lag across the network at the time of the failure.

The Heimdall Proxy automatically detects Amazon Relational Database Service (RDS) / Amazon Aurora configuration changes based on the ARN of the Aurora Global cluster. Therefore, both managed planned and manual unplanned failovers are supported.

Solution benefits for global applications

Implementing the Heimdall Proxy has many benefits for global applications:

An Aurora Global Database has a primary DB cluster in one Region and up to five secondary DB clusters in different Regions. But the Heimdall Proxy deployment does not have this limitation. This allows for a larger number of endpoints to be globally deployed. Combined with Amazon Route 53 latency-based routing, new connections have a shorter establishment time. They can use connection pooling to connect to the database, which reduces overall connection latency.
SQL results are cached to the application for faster response times.
The proxy intelligently routes non-cached queries. When safe to do so, the closest (lowest latency) reader will be used. When not safe to access the reader, the query will be routed to the global writer. Proxy nodes globally synchronize their state to ensure that volatile tables are locked to provide ACID compliance.

For more information on configuring the Heimdall Proxy and Amazon Route 53 for a global database, read the Heimdall Proxy for Aurora Global Database Solution Guide.

Download a free trial from the AWS Marketplace.

Resources:

AWS Blog: How to Split Reads and Writes for Amazon RDS
AWS Blog: Automated Query Caching
AWS Blog: Advanced Connection Pooling
Contact: [email protected]

Heimdall Data, based in the San Francisco Bay Area, is an AWS Advanced ISV partner. They have AWS Service Ready designations for Amazon RDS and Amazon Redshift. Heimdall Data offers a database proxy that offloads SQL improving database scale. Deployment does not require code changes.

Detect Real-Time Anomalies and Failures in Industrial Processes Using Apache Flink

2022-01-10 Hubert Asamer

Post Syndicated from Hubert Asamer original https://aws.amazon.com/blogs/architecture/detect-real-time-anomalies-and-failures-in-industrial-processes-using-apache-flink/

For a long time, industrial control systems were the heart of the manufacturing process which allows collecting, processing, and acting on data from the shop floor. Process manufacturers used a distributed control system (DCS) to do the automated control and operation of an industrial process or plant.

With the convergence of operational technology and information technology (IT), customers such as Yara are integrating their DCS with additional intelligence from the IT side. This provides customers with a holistic view of the different data sources to make more complex decisions with advanced analytics.

In this blog post, we show how to start with advanced analytics on streaming data coming from the shop floor. The sensor data, such as pressure and temperature, is typically published by a DCS. It is then ingested with a local edge gateway and streamed to the cloud with streaming and industrial internet of things (IoT) technology. Analytics on the streaming data is typically done before all data points are stored in the data layer. Figure 1 shows how the data flow can be modeled and visualized with AWS services.

Figure 1: High-level ingestion and analytics architecture

In this example, we are concentrating on the streaming analytics part in the Cloud. We will generate data from a simulated DCS to Amazon Kinesis Data Streams where you have a gateway such as AWS IoT Greengrass and maybe other IoT services in-between.

For the simulated process that the DCS is controlling, we use a well-documented industrial process for creating a chemical compound (acetic anhydride) called the Tennesee Eastman process (TEP). There are several simulations available as open source. We demonstrate how to use this data as a constant stream with more than 30 real-time measurement parameters, ingest to Kinesis Data Streams, and run in-stream analytics using Apache Flink. Within Apache Flink, data is grouped and mapped to the respective stages and parts of the industrial process, and constantly analyzed by calculating anomalies of all process stages. All raw data, plus the derived anomalies and failure patterns, are then ingested from Apache Flink to Amazon Timestream for further use in near real-time dashboards.

Overview of solution

Note: Refer to steps 1 to 6 in Figure 2.

As a starting point for a realistic and data intensive measurement source, we use an already existing (TEP) simulation framework written in C++ originally created from National Institute of Standards and Technology, and published as open source. The GitHub Blog repository contains a small patch which adds AWS connectivity with the software development kits (SDKs) and modifications to the command line arguments. The programs provided by this framework are (step 1) a simulation process starter with configurable starting conditions and timestep configurations and a real-time client (step 2) which connects to the simulation and sends the simulation output data to the AWS Cloud.

Tennesee Eastman process (TEP) background

A paper by Downs & Vogel, A plant-wide industrial process control problem, from 1991 states:

“This chemical standard process consists of a reactor/separator/recycle arrangement involving two simultaneous gas-liquid exothermic reactions.”

“The process produces two liquid products from four reactants. Also present are an inert and a byproduct making a total of eight components. Two additional byproduct reactions also occur. The process has 12 valves available for manipulation and 41 measurements available for monitoring or control.“

The simulation framework used can control all of the 12 valve settings and produces 41 measurement variables with varying sampling frequency.

Data ingestion

The 41 measurement variables, named xmeas_1 to xmeas_41, are emitted by the real-time client (step 2) as key-value JSON messages. The client code is configured to produce 100 messages per second. A built-in C++ Kinesis SDK allows the real-time client to directly stream JSON messages to a Kinesis data stream (step 3).

Figure 2 – Detailed system architecture

Stream processing with Apache Flink

Messages sent to Amazon Kinesis Data Stream are processed in configurable batch sizes by an Apache Flink application, deployed in Amazon Kinesis Data Analytics. Apache Flink is an open-source stream processing framework, written and usable in Java or Scala. As described in Figure 3, it allows the definition of various data sources (for example, a Kinesis data stream) and data sinks for storing processing results. In-between data can be processed by a range of operators—typically mapping and reducing functions (step 4).

In our case, we use a mapping operator where each batch of incoming messages is processed. In Code snippet 1, we apply a custom mapping function to the raw data stream. For rapid and iterative development purposes it’s possible to have the complete stream processing pipeline running in a local Java or Scala IDE such as Maven, Eclipse, or IntelliJ.

Figure 3: Flink execution plan (green: streaming data sources; yellow: data sinks)

public class StreamingJob extends AnomalyDetector {
---
  public static DataStream<String> createKinesisSource
    (StreamExecutionEnvironment env, 
     ParameterTool parameter)
    {
    // create Stream
    return kinesisStream;
  }
---
  public static void main(String[] args) {
    // set up the execution environment
    final StreamExecutionEnvironment env = 
      StreamExecutionEnvironment.getExecutionEnvironment();
---
    DataStream<List<TimestreamPoint>> mainStream =
      createKinesisSource(env, parameter)
      .map(new AnomalyJsonToTimestreamPayloadFn(parameter))
      .name("MaptoTimestreamPayload");
---
    env.execute("Amazon Timestream Flink Anomaly Detection Sink");
  }
}

Code snippet 1: Flink application main class

In-stream anomaly detection

Within the Flink mapping operator, a statistical outlier detection (anomaly detection) is implemented. Flink allows the inclusion of custom libraries within its operators. The library used here is published by AWS—a Random Cut Forest implementation available from GitHub. Random Cut Forest is a well understood statistical method which can operate on batches of measurements. It then calculates an anomaly score for each new measurement by comparing a new value with a cached pool (=forest) of older values.

The algorithm allows the creation of grouped anomaly scores, where a set of variables is combined to calculate a single anomaly score. In the simulated chemical process (TEP), we can group the measurement variables into three process stages:

reactor feed analysis
purge gas analysis
product analysis.

Each group consists of 5–10 measurement variables. We’re getting anomaly scores for a, b, and c. In Code snippet 2 we can learn how an anomaly detector is created. The class AnomalyDetector is instantiated and extended then three times (for our three distinct process stages) within the mapping function as described in Code snippet 3.

Flink distributes this calculation across its worker nodes and handles data deduplication processes within its system.

---
public class AnomalyDetector {
    protected final ParameterTool parameter;
    protected final Function<RandomCutForest, LineTransformer> algorithmInitializer;
    protected LineTransformer algorithm;
    protected ShingleBuilder shingleBuilder;
    protected double[] pointBuffer;
    protected double[] shingleBuffer;
    public AnomalyDetector(
      ParameterTool parameter,
      Function<RandomCutForest,LineTransformer> algorithmInitializer)
    {
      this.parameter = parameter;
      this.algorithmInitializer = algorithmInitializer;
    }
    public List<String> run(Double[] values) {
            if (pointBuffer == null) {
                prepareAlgorithm(values.length);
            }
      return processLine(values);
    }
    protected void prepareAlgorithm(int dimensions) {
---
      RandomCutForest forest = RandomCutForest.builder()
        .numberOfTrees(Integer.parseInt(
          parameter.get("RcfNumberOfTrees", "50")))
        .sampleSize(Integer.parseInt(
          parameter.get("RcfSampleSize", "8192")))
        .dimensions(shingleBuilder.getShingledPointSize())
        .lambda(Double.parseDouble(
          parameter.get("RcfLambda", "0.00001220703125")))
        .randomSeed(Integer.parseInt(
          parameter.get("RcfRandomSeed", "42")))
      .build();
---
    algorithm = algorithmInitializer.apply(forest);
  }

Code snippet 2: AnomalyDetector base class, which gets extended by the streaming applications main class

public class AnomalyJsonToTimestreamPayloadFn extends 
    RichMapFunction<String, List<TimestreamPoint>> {
  protected final ParameterTool parameter;
  private final Logger logger = 

  public AnomalyJsonToTimestreamPayloadFn(ParameterTool parameter) {
    this.parameter = parameter;
  }

  // create new instance of StreamingJob for running our Forest
  StreamingJob overallAnomalyRunner1;
  StreamingJob overallAnomalyRunner2;
  StreamingJob overallAnomalyRunner3;
---

  // use `open`method as RCF initialization
  @Override
  public void open(Configuration parameters) throws Exception {
    overallAnomalyRunner1 = new StreamingJob(parameter);
    overallAnomalyRunner2 = new StreamingJob(parameter);
    overallAnomalyRunner3 = new StreamingJob(parameter);
  super.open(parameters);
}
---

Code snippet 3: Mapping Function uses the Flink RichMapFunction open routine to initialize three distinct Random Cut Forests

Data persistence – Flink data sinks

After all anomalies are calculated, we can decide where to send this data. Flink provides various ready-to-use data sinks. In these examples, we fan out all (raw and processed) data to Amazon Kinesis Data Firehose for storing in Amazon Simple Storage Service (Amazon S3) (long term) (step 5) and to Amazon Timestream (short term) (step 5). Kinesis Data Firehose is configured with a small AWS Lambda function to reformat data from JSON to CSV, and data is stored with automated partitioning to Amazon S3. A Timestream data sink does not come pre-bundled with Flink. A custom Timestream ingestion code is used in these examples. Flink provides extensible operator interfaces for the creation of custom map and sink functions.

Timeseries handling

Timestream, in combination with Grafana, is used for near real-time monitoring. Grafana comes bundled with a Timestream data source plugin and can constantly query and visualize Timestream data (step 6).

Walkthrough

Our architecture is available as a deployable AWS CloudFormation template. The simulation framework comes packed as a docker image, with an option to install it locally on a linux host.

Prerequisites

To implement this architecture, you will need:

An AWS account
Docker (CE) Engine v18++
Java JDK v11++
maven v3.6++

We recommend running a local and recent Linux environment. It is assumed that you are using AWS Cloud9, deployed with CloudFormation, within your AWS account.

Steps

Follow these steps to deploy the solution and play with the simulation framework. At the end, detected anomalies derived from Flink are stored next to all raw data in Timestream and presented in Grafana. We’re using AWS Cloud9 and its Linux terminal capabilities here to fire up a Grafana instance, then manually run the simulation to ingest data to Kinesis and optionally manually start the Flink app from the console using Maven.

Deploy stack

After you’re logged in to the AWS Management console you can deploy the CloudFormation stack. This stack creates a fully configured AWS Cloud9 environment with the related GitHub Repo already in place, a Kinesis data stream, Kinesis Data Firehose delivery stream, Kinesis Data Analytics with Flink app deployed, Timestream database, and an S3 bucket.

After successful deployment, record two important facts from the CloudFormation console: the chosen stack name and the attribute 03Cloud9EnvUrl displayed in the Output Section of the stack. The attribute’s URL will take you directly to our deployed AWS Cloud9 environment.

Run post install step within AWS Cloud9

The deployed stack created an AWS Cloud9 environment and an AWS Identity and Access Management (IAM) instance profile. We apply this instance profile to AWS Cloud9 to interact with Kinesis, Timestream, and Amazon S3 throughout the next steps. The used script also configures and installs other required tools.

1. Open a terminal window.

$ cd flinkAnomalySteps/deployment
$ source c9-postInstall.sh
---SETTING UP IAM INSTANCE PROFILE
Please enter cloudformation stack name (default: flink-rcf-app):
# enter your stack name

Start a Grafana development server

In this section we are starting a Grafana server using docker. Cloud 9 allows us to expose web applications (for demo & development purposes) on container port 8080.

1. Open a terminal window.

$ cd ../src/grafana-dashboard
$ docker volume create grafana-storage
# this creates a docker volume for persisting your Grafana settings
$./start-grafana.sh
# this starts a recent Grafana using docker with Timestream plugin and a pre-configured dashboard in place

2. Open the preview panel by selecting Preview, and then select Preview Running Application.

3. Next, in the preview pane, select Pop out into new Window.

4. A new browser tab with Grafana opens.

5. Choose any username and password combination.

6. In Grafana use the “Search dashboards” icon on the left and choose “TEP-SIM-DEV”. This pre-configured dashboard displays data from Amazon Timestream (see step “Open Grafana”).

TEP simulation procedure

Within your local or AWS Cloud9 Linux environment, fetch the simulation docker image from the public AWS container registry, or build the simulation binaries locally, for building manually check the GitHub repo.

Start simulation (in separate terminal)

# starts container and switch into the container-shell
$ docker run -it --rm \
  --network host \
  --name tesim-runner \
  tesim-runner:01 \
 /bin/bash
# then inside container
$ ./tesim --simtime 100  --external-ctrl
# simulation started…

Manipulate simulation (in separate terminal)

Follow the steps here for a basic process disturbance task. Review the aspects of influencing the simulation in the GitHub-Repo. The rtclient program has a range of commands to use for introducing disturbances.

# first switch into running simulation container
$ docker exec -it tesim-runner /bin/bash
# now we can access the shared storage of the simulation process…
$ ./rtclient –setidv 6
# this enables one of the built in process disturbances (1-20)
$ ./rtclient –setidv 7
$ ./rtclient –setidv 8
$ …

Stream Simulation data to Amazon Kinesis DataStream (in separate terminal)

The client has a built-in record frequency of 50 messages per second. One message contains more than 50 measurements, so we have approximately 2,500 measurements per second.

$ ./rtclient -k

AWS libcrypto resolve: found static libcrypto 1.1.1 HMAC symbols
AWS libcrypto resolve: found static libcrypto 1.1.1 EVP_MD symbols
{"xmeas_1": 3649.739476,"xmeas_2": 4451.32071,"xmeas_3": 9.223142558,"xmeas_4": 32.39290913,"xmeas_5": 47.55975621,"xmeas_6": 2798.975688,"xmeas_7": 64.99582601,"xmeas_8": 122.8987929,"xmeas_9": 0.1978264656,…}
# Messages in JSON sent to Kinesis DataStream visible via stdout

Compile and start Flink Application (optional step)

If you want deeper insights into the Flink Application, we can start this as well from the AWS Cloud9 instance. Note: this is only appropriate in development.

$ cd flinkAnomalySteps/src
$ cd flink-rcf-app
$ mvn clean compile
# the Flink app gets compiled
$ mvn exec:java -Dexec.mainClass= \
    "com.amazonaws.services.kinesisanalytics.StreamingJob"
# Flink App is started with default settings in place…
…

Open Grafana dashboard (from the step Start a Grafana development server)

Process anomalies are visible instantly after you start the simulation. Use Grafana to drill down into the data as needed.

/**example - simplest possible Timestream query used for Viz:**/

SELECT CREATE_TIME_SERIES(time, measure_value::double) as anomaly_stream6 FROM "kdaflink"."kinesisdata1"

    WHERE measure_name='anomaly_score_stream6' AND

    time between ago(15m) and now()

Code snippet 4: Timestream SQL example; Timestream database is `kdaflink` – table is `kinesisdata1`

Figure 4 – Grafana dashboard showing near real-time simulation data, three anomalies, mapped to the TEP process, are constantly calculated by Flink

S3 raw metrics bucket

For the sake of completeness and potential usefulness, the Flink Application emits all raw data in an intermediate step to Kinesis Data Firehose. The service converts all JSON data to CSV format by using a small AWS Lambda function.

$ aws s3 ls flink-rcf-app-rawmetricsbucket-<CFN-UUID>/tep-raw-csv/2021/11/26/19/

Cleaning up

Delete the deployed CloudFormation stack. All resources (excluding S3 buckets) are permanently deleted.

Conclusion

In this blog post, we learned that in-stream anomaly detection and constant measurement data insights can work together. The Apache Flink framework offers a ready-to-use platform that is mission critical for future adoption across manufacturing and other industries. Other applications of the presented Flink pattern can run on capable edge compute devices. Integration with AWS IoT Greengrass and AWS Greengrass Stream Manager are part of the GitHub Blog repository.

Another extension includes measurement data pattern detection routines, which can coexist with in-stream anomaly detection and can detect specific failure patterns over time using time-windowing features of the Flink framework. You can refer to the GitHub repo which accompanies this blog post. Give it a try and let us know your feedback in the comments!

Developing a Platform for Software-defined Vehicles with Continental Automotive Edge (CAEdge)

2022-01-06 Martin Stamm

Post Syndicated from Martin Stamm original https://aws.amazon.com/blogs/architecture/developing-a-platform-for-software-defined-vehicles-with-continental-automotive-edge-caedge/

This post was co-written by Martin Stamm, Principal Expert SW Architecture at Continental Automotive, Andreas Falkenberg, Senior Consultant at AWS Professional Services, Daniel Krumpholz, Engagement Manager at AWS Professional Services, David Crescence, Sr. Engagement Manager at AWS, and Junjie Tang, Principal Consultant at AWS Professional Services.

Automakers are embarking on a digital transformation journey to become more agile, efficient, and innovative. As part of this transformation, Continental created Continental Automotive Edge (CAEdge) – a modular multi-tenant hardware and software framework that connects the vehicle to the cloud. Continental collaborated with Amazon Web Services (AWS) to develop and scale this framework.

At this AWS re:Invent session, Continental and AWS demonstrated the new and transformative vehicle architectures and software built with CAEdge. These will provide future vehicle manufacturers, Original equipment manufacturers (OEMs) and partners with a multi-tenant development environment for software-intensive vehicle architectures. These can be used to implement software, sensor and big data solutions in a fraction of the development time needed before. As a result, vehicle software can be developed and tested more efficiently, then securely and rolled out directly to vehicles. The framework is already being tested in an automotive manufacturer’s series development.

Addressing core automotive industry pain points

Continental, OEMs and other major Tier 1 companies are required to quickly adapt to new technology requirements without knowing capacity or scaling needs, while at the same time staying ahead of the market. Developers are facing several challenges, in particular the processing of huge amounts of data. For example, a single test vehicle for AV/ADAS generates 20 – 100 TB of data per day. The handling of these data sets and the time to availability in distributed sites can cause major delays in development cycles. Delays are also experienced by developers due to the high numbers of test cases in simulation and validation. In an on-premises environment, this poses significant costs and scaling challenges to provide the required capacity.

The pace of the required transformation to becoming a software-centric organization is creating new challenges and opportunities like:

Current electronic architectures are decentralized, expensive, and complex to develop therefore difficult to maintain and extend.
Vehicle and cloud converge require new software (SW)-defined architectures, integration and operations competencies.
Digital Lifecycle Management enables new business models, go- to-market strategies and partnerships.

In addition to the distribution of huge datasets and distributed work setups is a need for cutting edge security technologies. Encryption at transfer/rest, data residency laws, and secure developer access are common security challenges and are addressed using CAEdge technology.

In this blog post, we describe how to build a secure multi-tenant AWS environment that forms the foundation for CAEdge. We discuss how AWS is helping Continental build the base infrastructure that allows for fast onboarding of OEMs, partners and suppliers through a highly automated process. Development activities can start within hours, instead of days or weeks; with a bootstrapped development environment for software-intensive vehicle architectures. This is all while meeting the strictest security and compliance requirements.

Overview of the CAEdge Framework

The following diagram gives an overview of the CAEdge Framework:

Architecture Diagram showing the CAEdge Platform

Figure 1 – Architecture Diagram showing the CAEdge Framework

The framework is based on the following modular building blocks:

Scalable Compute Platform – High Performance, embedded computer with automotive software stack and connection to the AWS cloud.
Cloud – Cloud services for developers and end-users.
DevOps Workbench – Toolchain for software development and maintenance covering the entire software lifecycle.

The building blocks of the framework are defined by clear API operations and can be integrated easily for various use cases, such as different middleware or CI / CD pipelines.

Overview of the CAEdge Multi-Tenant Framework

Continentals’ core architecture and terminology for a vehicle software development framework include:

CAEdge Framework as an Isolated AWS Organization – Continental’s CAEdge framework runs in a dedicated AWS Organization. Only CAEdge-related workloads are hosted in this AWS Organization. This ensures separation from any other workloads outside of the CAEdge context. The CAEdge framework provides multiple central security, access management, and orchestration services to its users.
Isolated Tenants – The framework is fully tenant-aware. A tenant is an isolated entity that represents an OEM, OEM sub-division, partner, or supplier. A key feature of this system is to ensure complete isolation from one tenant to another. We use a defense-in-depth security approach to ensure tenant separation.
Tenant-Owned Resources and Services – Each tenant has a dedicated set of resources and services that can be consumed and used by all tenant users and services. Tenant-owned resources and services include, but are not limited to:
- Dedicated, tenant-specific data lake,
- Tenant specific logging, monitoring, and operations,
- Tenant-specific UI.
Projects – Each tenant can host an arbitrary number of projects with 1-N users assigned to them. A project is a high-level construct with the goal to create a unique product or service, such as a new “Door Lock” system software. The term project is used in a broad scope. Anything can be classified as a project.
Workbenches – A project consists of 1-N well-defined workbenches. A workbench represents a fully configured development environment of a specific “Workbench Type”. For example, a workbench of type “Simulation” allows for configuration and execution of Simulation Jobs based on drive data. Each workbench is implemented via a well-defined number of AWS Accounts, which is called an AWS Account Set.
- An AWS Account Set always includes at least a Toolchain Account, Dev Account, QA Account and Prod Account. All AWS Accounts are baselined with IAM resources, security services and potentially workbench specific blueprints so development can start quickly for the end-user.

The following diagram illustrates the high-level architecture:

Figure 2 – High-level architecture diagram

The CAEdge framework uses a data mesh architecture using AWS Lake Formation and Glue to create the tenant-level data lake. The Reference Architecture for Autonomous Driving Data Lake is used to design the Simulation workbench.

Implementation Details

With the core architecture and terminology defined, let’s look at the implementation details of the architecture that was described in the preceding image.

Isolated Tenants – Achieving a High Degree of Separation

To achieve a multi-tenant environment, we followed a multi-layered security hardening approach:

Tenant Separation on AWS Account Level: Starting at the AWS Account level, we used individual AWS Accounts where possible. An AWS account can never be assigned to more than one tenant. The functional scope of an AWS Account is kept as small as possible. This increases the number of total AWS Accounts, but significantly reduces the blast radius in case of any breach or incident. Just to give an example:
- Each Dev, QA, and Prod Stage of a Workbench is its own AWS Account. No AWS Account ever combines multiple stages at once.
- Each CAEdge tenant-owned data lake consists of multiple AWS Accounts. A data lake also requires updates as time passes. To allow for side-effect free and well tested updates of the data lake-infrastructure, each tenant comes with a Dev, QA, and Prod data lake.
Tenant Separation via a well-defined Organizational Unit (OU) structure and Service Control Policies (SCP): Each Tenant gets assigned a dedicated OU structure with multiple sub-OUs. This allows for tenant-specific security hardening on SCP-level and potential custom security hardening, in case dedicated tenants have specific security requirements. The SCPs are designed in such a way to allow for a maximum degree of freedom for the individual AWS Account user; while at the same time protecting the integrity of CAEdge and while staying compliant and secure according to specific requirements.
Tenant Separation through an AWS Account Metadata-Layer and automated IAM assignments: The CAedge framework uses a central Amazon DynamoDB database that maps AWS Accounts to Tenants and stores any other Metadata in the Context of an AWS Account. This includes including the Account Owner, Account Type, and Cost-related information. With this database, we can query AWS Accounts based on specific Tenants, Projects, and Workbenches. Furthermore, this forms the foundation of a fully automated permission and AWS Account access-management capability that enforces any Tenant, Project and Workbench boundary.
Tenant Separation Security Controls via AWS Security Services: On top of the standard AWS security services, such as AWS GuardDuty, AWS Config, AWS Inspector and AWS SecurityHub, we use IAM Access Analyzer in combination with our DynamoDB Account Metadata Store to detect the creation of any cross-account permissions that span outside of the AWS Organization, or that may have Cross-Tenant implications.

Automated creation and management of Tenant-Owned Resources and Services, Projects and Workbenches

CAEdge follows the “Everything-as-an-API Approach” and is designed as a fully open platform on the internet. All key features are exposed via a secured, public API. This includes the creation of Projects, Workbenches, and AWS Accounts including the management of access rights in a self-service manner for authorized users, as well as any updates affecting subsequent long-term management. This can only be achieved through a very high degree of automation.

We architect the following services to achieve this high degree of automation:

AWS Control Tower – An AWS managed service for account creation and OU assignment.
AWS Deployment Framework (AWS ADF) – an extensive and flexible framework to manage and deploy resources across multiple AWS Accounts and Regions within an AWS Organization. We use ADF to baseline all accounts with the resources required. This includes all security services, default IAM Roles, network related resources, such as VPCs and DNS and any other resources specific to the AWS Account type.
AWS Single Sign-On (AWS SSO) – A central IAM solution to control access to AWS Accounts. AWS SSO assignments are fully automated based on our defined access patterns using our custom Dispatch application and an extended version of the AWS SSO Enterprise solution.
AWS DynamoDB – A fully managed NoSQL database service for storing tenant, project and AWS Account data. Including information related to ownership, cost management, access management.
Dispatch CAEdge Web Application – A fully serverless web application that exposes functionality to end-users via API calls. It handles authentication, authorization, and provides business logic in the form of AWS Lambda functions to orchestrate all of the aforementioned services.

The following diagram provides a high-level overview of the automation mechanism at the platform level:

Figure 3 – High-level overview of the automation mechanism

With this solution in place, Continental enables OEMs, suppliers, and other partners to spin up developer workbenches in a tenant context within minutes; thereby reducing the setup time from weeks to minutes using a self-service approach.

Conclusion

In this post, we showed how Continental built a secure multi-tenant platform that serves as the scalable foundation for software-intensive, vehicle-related workloads. For other organizations experiencing challenges when transforming into a software-centric organization, this solution eases the onboarding process so developers can start building within hours instead of months.

Building a Cloud-Native File Transfer Platform Using AWS Transfer Family Workflows

2022-01-05 Shoeb Bustani

Post Syndicated from Shoeb Bustani original https://aws.amazon.com/blogs/architecture/building-a-cloud-native-file-transfer-platform-using-aws-transfer-family-workflows/

File-based transfers are one of the most prevalent mechanisms for organizations to exchange data over various interfaces with their partners and consumers. There are specialized third-party managed file transfer (MFT) products available in the market that provide rich workflows for managing these transfers.

A typical MFT platform provides features to perform a series of linked pre- and post-file upload processing steps. The new managed workflows feature within AWS Transfer Family allows you to define a lightweight workflow that is invoked in response to file uploads. This feature, combined with the core SFTP, FTPS, and FTP functionality, enables you to build a cloud-native MFT platform for your organization. The workflows are also integrated with Amazon CloudWatch to provide complete traceability.

Before this feature was released, the MFT architecture based on Transfer Family involved responding to Amazon Simple Storage Service (Amazon S3) events within AWS Lambda functions. There was no overarching orchestration layer. With the new managed workflows feature, the sequencing of steps and error handling is greatly simplified.

In this blog, I show you how to architect common MFT scenarios using the new Transfer Family managed workflows feature. This will help you build a robust and well-integrated cloud-native MFT platform.

Scenario 1: Inbound Flow – file push by external providers

In this scenario, a file is supplied by an external data provider. It must be decrypted, checked for errors, and transferred to an internal application area (Amazon S3 bucket) for further processing by an application.

The internal application that processes the file could be an in-house Java application, an Enterprise Resourcing Planning system that processes payments, telecommunication billing system that consumes call data, or even financial regulatory organization that scans daily share trading data for anomalies.

The architecture for this scenario is presented in Figure 1. Here’s how it works:

The external data provider connects to the organization’s public Transfer Family endpoint and provides the authentication credentials.
The service authenticates the user via the pre-configured authentication mechanism. This could be a custom identity provider, AWS Directory Service, or service managed.
Once authenticated, the data provider uploads the file to a logical folder. This results in the file being stored in the underlying Upload S3 bucket.
Transfer Family initiates the configured workflow once the file has been uploaded to the S3 bucket. The workflow performs the required pre-processing steps, including:
- Invoking a Lambda function to decrypt the file.
- Invoking a Lambda function to ensure the file data is valid.
- Copying the file to the Application S3 bucket.
- Deleting or archiving the file by copying it to another S3 bucket or storing it with a different S3 prefix.
- If an error occurs, the workflow exception handler moves (copy and delete) the file to the Quarantine S3 bucket or stores it with a different S3 prefix.

Figure 1. MFT inbound flow – push by data provider

Scenario 2: Outbound flow – file push to external consumers

In this scenario, an internal application generates files that are to be provided to external parties. Examples include submissions to credit check agencies, direct debits or payment files to banking institutions. These files must be re-formatted, encrypted, and transferred to an external SFTP site or an API endpoint.

The architecture for implementing this scenario is presented in Figure 2. Here’s how it works:

An internal application connects to the organization’s Transfer Family’s private SFTP endpoint hosted within Amazon Virtual Private Cloud (Amazon VPC) and provides the authentication information.
The service authenticates the application using the pre-configured authentication mechanism. This could be a custom identity provider, Directory Service, or service managed.
Once authenticated, the application uploads the file to a logical folder. This results in the file being stored in the underlying Upload S3 bucket.
Transfer Family initiates the configured workflow once the file has been uploaded to the S3 bucket. The workflow performs the required processing steps, including:
- Invoking a Lambda function to reformat and encrypt the file.
- Invoking a Lambda function to transfer the file to the external SFTP site or API endpoint.
- Copying the transferred file to the Processed S3 bucket or storing the file with a different Amazon S3 prefix.
- Emptying the internal upload folder by deleting the file.
- In case of errors, the workflow exception handler moves (copy and delete) the file to Error S3 bucket or stores with a different S3 prefix.

Figure 2. MFT outbound flow – push to data consumer

Scenario 3: Outbound Flow – file pull by external consumers

In this scenario, an internal application generates files that are to be provided to external parties. However, in this case, the files are downloaded or “pulled” from the external facing SFTP download folder by the consumers.

Examples include scenarios wherein external parties have a pre-defined schedule to download files or the consumers need to download the files manually in absence of an SFTP endpoint on their side.

The architecture for this scenario is presented in Figure 3. In this case, two instances of Transfer Family are created:

Internal facing private instance from Scenario 2.
External facing public instance to be used by the consumer for file downloads.

Here’s how it works:

Steps A through D. Flow remains the same as Scenario 2, except the internal workflow task uploads files to the S3 bucket underneath the external facing instance of Transfer Family.
E. The external consumer connects to the organization’s public Transfer Family endpoint and provides the authentication credentials.
F. The external facing Transfer Family service instance authenticates the consumer using the pre-configured authentication mechanism. This could be a custom identity provider, AWS Directory Service, or service managed.
G. Once authenticated, the data consumer downloads the file from the external Transfer Family SFTP server instance.

Figure 3. MFT outbound flow – pull by data consumer

Conclusion

The new managed workflow feature within Transfer Family provides a simple mechanism to create file transfer flows. In this blog post, I showed you some of the common use cases you can implement using this new feature. You can combine this architecture approach with additional AWS services to build a robust and well-integrated cloud native managed file transfer platform.

Related information

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

Top 5: Featured Architecture Content from December 2021

2022-01-04 Ellen Crowley

Post Syndicated from Ellen Crowley original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-from-december-2021/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from December’s new and updated content.

1. Sustainability Pillar – AWS Well-Architected Framework

This new pillar in the Well-Architected framework helps organizations learn, measure, and improve their workloads using environmental best practices for cloud computing. Did you know that the shared responsibility model also applies to sustainability? You can use the pillar to track your progress against best practices to support sustainability. Your development teams can also use this new pillar and Well-Architected best practices to support many sustainability use cases. These can include reducing the energy or resources required to run workloads and anticipate and adopt new and more efficient technology offerings.

2. Retail Customer Service Contact Center

Increasingly, customers expect to be able to ask questions and get assistance quickly and through various channels. The companies that embrace this the fastest see increases in customer engagement and satisfaction. This reference architecture shows physical and ecommerce retailers how to build a next generation customer contact center. It aims to simplify and transform their customer service channels with natural language processing and automation.

Retail Customer Service Reference Architecture Diagram

Retail Customer Service Contact Center Reference Architecture Diagram

3. Establishing Your Cloud Foundation on AWS

When planning a cloud adoption strategy you are often faced with a number of complex decisions to stand up and scale a production-ready cloud environment. This whitepaper guides you through building and evolving your AWS Cloud environment based on a set of definitions, use cases, guidance, and automations.

4. Hybrid Networking Lens of the Well-Architected Framework

This new lens is intended for those in technology roles, such as chief technology officers (CTOs), architects, developers, and operations team members. It provides AWS best practices and strategies to use when designing hybrid networking architectures. If you’re looking to build hybrid networking architectures to integrate your on-premises data center and AWS operations to support a broad spectrum of use cases, this lens will help set you up for success. It outlines three areas to consider when designing hybrid network connectivity for your workload: data layer, monitoring and configuration management, and security.

5. AWS Virtual Waiting Room

This Solutions Implementation helps buffer incoming user requests to your website during large bursts of traffic. It creates a cloud infrastructure designed to temporarily offload incoming traffic to your website, and it provides options to customize and integrate a virtual waiting room. The waiting room acts as a holding area for visitors to your website and allows traffic to pass through when there is enough capacity.

Looking for more new and updated content from this year? Check out the other posts in the Top 5 series!

Looking for more architecture content? AWS Architecture Center provides reference architecture diagrams, vetted architecture solutions, Well-Architected best practices, patterns, icons, and more!

How Meshify Built an Insurance-focused IoT Solution on AWS

2022-01-03 Grant Fisher

Post Syndicated from Grant Fisher original https://aws.amazon.com/blogs/architecture/how-meshify-built-an-insurance-focused-iot-solution-on-aws/

The ability to analyze your Internet of Things (IoT) data can help you prevent loss, improve safety, boost productivity, and even develop an entirely new business model. This data is even more valuable, with the ever-increasing number of connected devices. Companies use Amazon Web Services (AWS) IoT services to build innovative solutions, including secure edge device connectivity, ingestion, storage, and IoT data analytics.

This post describes Meshify’s IoT sensor solution, built on AWS, that helps businesses and organizations prevent property damage and avoid loss for the property-casualty insurance industry. The solution uses real-time data insights, which result in fewer claims, better customer experience, and innovative new insurance products.

Through low-power, long-range IoT sensors, and dedicated applications, Meshify can notify customers of potential problems like rapid temperature decreases that could result in freeze damage, or rising humidity levels that could lead to mold. These risks can then be averted, instead of leading to costly damage that can impact small businesses and the insurer’s bottom line.

Architecture building blocks

The three building blocks of this technical architecture are the edge portfolio, data ingestion, and data processing and analytics, shown in Figure 1.

Figure 1. Building blocks of Meshify’s technical architecture

I. Edge portfolio (EP)

Starting with the edge sensors, the Meshify edge portfolio covers two types of sensors:

LoRaWAN (Low power, long range WAN) sensor suite. This sensor provides the long connectivity range (> 1000 feet) and extended battery life (~ 5 years) needed for enterprise environments.
Cellular-based sensors. This sensor is a narrow band/LTE-M device that operates at LTE-M band 2/4/12 radio frequency and uses edge intelligence to conserve battery life.

II. Data ingestion (DI)

For the LoRaWAN solution, aggregated sensor data at the Meshify gateway is sent to AWS using AWS IoT Core and Meshify’s REST service endpoints. AWS IoT Core is a managed cloud platform that lets IoT devices easily and securely connect using multiple protocols like HTTP, MQTT, and WebSockets. It expands its protocol coverage through a new fully managed feature called AWS IoT Core for LoRaWAN. This gives Meshify the ability to connect LoRaWAN wireless devices with the AWS Cloud. AWS IoT Core for LoRaWAN delivers a LoRaWAN network server (LNS) that provides gateway management using the Configuration and Update Server (CUPS) and Firmware Updates Over-The-Air (FUOTA) capabilities.

III. Data processing and analytics (DPA)

Initial processing of the data is done at the ingestion layer, using Meshify REST API endpoints and the Rules Engine of AWS IoT Core. Meshify applies filtering logic to route relevant events to Amazon Managed Streaming for Apache Kafka (Amazon MSK). Amazon MSK is an AWS streaming data service that manages Apache Kafka infrastructure and operations, streamlining the process of running Apache Kafka applications on AWS.

Meshify’s applications then consume the events from Amazon MSK per the configured topic subscription. They enrich and correlate the events with the records with a managed service, Amazon Relational Database Service (RDS). These applications run as scalable containers on another managed service, Amazon Elastic Kubernetes Service (EKS), which runs container applications.

Bringing it all together – technical workflow

In Figure 2, we illustrate the technical workflow from the ingestion of field events to their processing, enrichment, and persistence. Finally, we use these events to power risk avoidance decision-making.

Figure 2. Technical workflow for Meshify IoT architecture

After installation, Meshify-designed LoRa sensors transmit information to the cloud through Meshify’s gateways. LoRaWAN capabilities create connectivity between the sensors and the gateways. They establish a low power, wide area network protocol that securely transmits data over a long distance, through walls and floors of even the largest buildings.
The Meshify Gateway is a redundant edge system, capable of sending sensor data from various sensors to the Meshify cloud environment. Once the LoRa sensor information is received by the Meshify Gateway, it converts the incoming radio frequency (RF) signals, which support faster transfer rate to Meshify’s cloud environment.
Data from the Meshify Gateway and sensors is initially processed at Meshify’s AWS IoT Core and REST service endpoints. These destinations for IoT streaming data help with the initial intake and introduce field data to the Meshify cloud environment. The initial ingestion points can scale automatically based upon the volume of sensor data received. This enables rapid scaling and ease of implementation.
After the data has entered the Meshify cloud environment, Meshify uses Amazon EKS and Amazon MSK to process the incoming data stream. Amazon MSK producer and consumer applications within the EKS systems enrich the data streams for the end users and systems to consume.
Producer applications running on EKS send processed events to the Amazon MSK service. These events include storing and retrieval of raw data, enriched data, and system-level data.
Consumer applications hosted on the EKS pods receive events per the subscribed Amazon MSK topic. Web, mobile, and analytic applications enrich and use these data streams to display data to end users, business teams, and systems operations.
Processed events are persisted in Amazon RDS. The databases are used for reporting, machine learning, and other analytics and processing services.

Building a scalable IoT solution

Meshify first began work on the Meshify sensors and hosted platform in 2012. In the ensuing decade, Meshify has successfully created a platform to auto-scale upon demand with steady, predictable performance. This gave Meshify both the ability to use only the resources needed, and still have the capacity to handle unexpected voluminous data.

As the platform scaled, so did the volume of sensor data, operations and diagnostics data, and metadata from installations and deployments. Building an end-to-end data pipeline that integrates these different data sources and delivers co-related insights at low latency was time well spent.

Conclusion

In this post, we’ve shown how Meshify is using AWS services to power their suite of IoT sensors, software, and data platforms. Meshify’s most important architectural enhancements have involved the introduction of managed services, notably AWS IoT Core for LoRaWAN and Amazon MSK. These improvements have primarily focused on the data ingestion, data processing, and analytics stages.

Meshify continues to power the data revolution at the intersection of IoT and insurance at the edge, using AWS. Looking ahead, Meshify and HSB are excited at the prospect of scaling the relationship with AWS from cloud computing to the world of edge devices.

Learn more about how emerging startups and large enterprises are using AWS IoT services to build differentiated products.

Meshify is an IoT technology company and subsidiary of HSB, based in Austin, TX. Meshify builds pioneering sensor hardware, software, and data analytics solutions that protect businesses from property and equipment damage.

2021-12-24 Patrick Gryczka

Post Syndicated from Patrick Gryczka original https://aws.amazon.com/blogs/architecture/how-femr-delivers-cryptographically-secure-and-verifiable-emr-medical-data-with-amazon-qldb/

This post was co-written by Team fEMR’s President & Co-founder, Sarah Draugelis; CTO, Andy Mastie; Core Team Engineer & Fennel Labs Co-founder, Sean Batzel; Patrick Gryczka, AWS Solutions Architect; Mithil Prasad, AWS Senior Customer Solutions Manager.

Team fEMR is a non-profit organization that created a free, open-source Electronic Medical Records system for transient medical teams. Their system has helped bring aid and drive data driven communication in disaster relief scenarios and low resource settings around the world since 2014. In the past year, Team fEMR integrated Amazon Quantum Ledger Database (QLDB) as their HIPAA compliant database solution to address their need for data integrity and verifiability.

When delivering aid to at risk communities and within challenging social and political environments, data veracity is a critical issue. Patients need to trust that their data is confidential and following an appropriate chain of ownership. Aid organizations meanwhile need to trust the demographic data provided to them to appropriately understand the scope of disasters and verify the usage of funds. Amazon QLDB is backed by an immutable append-only journal. This journal is verifiable, making it easier for Team fEMR to engage in external audits and deliver a trusted and verifiable EMR solution. The teams that use the new fEMR system are typically working or volunteering in post-disaster environments, refugee camps, or in other mobile clinics that offer pre-hospital care.

In this blog post, we explore how Team fEMR leveraged Amazon QLDB and other AWS Managed Services to enable their relief efforts.

Background

Before the use of an electronic record keeping system, these teams often had to use paper records, which would easily be lost or destroyed. The new fEMR system allows the clinician to look up the patient’s history of illness, in order to provide the patient with a seamless continuity of care between clinic visits. Additionally, the collection of health data on a more macroscopic level allows researchers and data scientists to monitor for disease. This is- an especially important aspect of mobile medicine in a pandemic world.

The fEMR system has been deployed worldwide since 2014. In the original design, the system functioned solely in an on-premises environment. Clinicians were able to attach their own devices to the system, and have access to the EMR functionality and medical records. While the need for a standalone solution continues to exist for environments without data coverage and connectivity, demand for fEMR has increased rapidly and outpaced deployment capabilities as well as hardware availability. To solve for real-time deployment and scalability needs, Team fEMR migrated fEMR to the cloud and developed a HIPAA-compliant and secure architecture using Amazon QLDB.

As part of their cloud adoption strategy, Team fEMR decided to procure more managed services, to automate operational tasks and to optimize their resources.

Architecture showing how How fEMR Delivers Cryptographically Secure and Verifiable EMR Medical Data with Amazon QLDB

Figure 1 – Architecture showing How fEMR Delivers Cryptographically Secure and Verifiable EMR Medical Data with Amazon QLDB

The team built the preceding architecture using a combination of the following AWS managed services:

1. Amazon Quantum Ledger Database (QLDB) – By using QLDB, Team fEMR were able to build an unfalsifiable, HIPAA compliant, and cryptographically verifiable record of medical histories, as well as for clinic pharmacy usage and stocking.

2. AWS Elastic Beanstalk – Team fEMR uses Elastic Beanstalk to deploy and run their Django front end and application logic. It allows their agile development team to focus on development and feature delivery, by offloading the operational work of deployment, patching, and scaling.

3. Amazon Relational Database Service (RDS) – Team fEMR uses RDS for ad-hoc search, reporting, and analytics, whereas QLDB is not optimized for the specific requirement.

4. Amazon ElastiCache – Team fEMR uses Amazon ElastiCache to cache user session data to provide near real-time response times.

Data Security considerations

Data Security was a key consideration when building the fEMR EHR solution. End users of the solution are often working in environments with at-risk populations. That is, patients who may be fleeing persecution from their home country, or may be at risk due to discriminatory treatment. It is therefore imperative to secure their data. QLDB as a data storage mechanism provides a cryptographically secure history of all changes to data. This has the benefit of improved visibility and auditability and is invaluable in situations where medical data needs to be as tamper-evident as possible.

Using Auto Scaling to minimize operational effort

When Team fEMR engages with disaster relief, they deal with ambiguity around both when events may occur and at what scale their effects may be felt. By leveraging managed services like QLDB, RDS, and Elastic Beanstalk, Team fEMR was able to minimize the time their technical team spent on systems operations. Instead, they can focus optimizing and improving their technology architectures.

Use of Infrastructure as Code to enable fast global deployment

With Infrastructure as Code, Team fEMR was able to create repeatable deployments. They utilized AWS CloudFormation to deploy their Elasticache, RDS, QLDB, and Elastic Beanstalk environment. Elastic Beanstalk was used to further automate the deployment of infrastructure for their Django stack. Repeatability of deployment enables the team to have the flexibility they need to deploy in certain regions due to geographic and data sovereignty requirements.

Optimizing the architecture

The team found that simultaneous writing into two databases could cause inconsistencies if a write succeeds in one database and fails in the other. In addition, it puts the burden of identifying errors and rolling back updates on the application. Therefore, an improvement planned for this architecture is to stream successful transactions from Amazon QLDB to Amazon RDS using Amazon Kinesis Data Streams. This service provides them a way to replicate data from Amazon QLDB to Amazon RDS, and any other databases or dashboards. QLDB remains as as their System of Record.

Figure 2 – Optimized Architecture using Amazon Kinesis Data Streams

Conclusion

As a result of migrating their system to the cloud, Team fEMR was able to deliver their EMR system with less operational overhead and instead focus on bringing their solution to the communities that need it. By using Amazon QLDB, Team fEMR was able to make their solution easier to audit and enabled more trust in their work with at-risk populations.

To learn more about Team fEMR, you can read about their efforts on their organization’s website, and explore their Open Source contributions on GitHub.

For hands on experience with Amazon QLDB you can reference our QLDB Workshops and explore our QLDB Developer Guide.

Overview of solution

Customer IVR experience

Solution walkthrough

Providing a personalized experience for callers

Transaction tracking

Making payments via the payment portal

Generating a chat transcript for agents

Conclusion

Architect for lean, efficient, and durable devices

High-level sustainable IoT architecture

Introducing AWS IoT Core and AWS IoT Greengrass to your architecture

Optimize your IoT devices for leanness and efficiency with AWS IoT Core

Ensure device durability and longevity with AWS IoT Greengrass

Conclusion

Related information

Overview of our conceptual financial crime discovery solution

Using the AWS Cloud to scale up a graph database

Financial crimes rules

Generating data

Walkthrough of the financial investigator workflow

Prerequisites

Rules

Run the job and check your results

Visualization

Cleaning up

Conclusion

References

Further reading

Background: What is an RLN?

A queue – process – queue architectural model

Fine tuning RLN system components

Successful throughput testing

Looking forward

Pulse prototype requirements

Enter AWS serverless

Overview of Pulse’s solution architecture

Ingestion module

Events module

Message bus

Other modules

Summary and next steps

See you next time!

Failover plan dependencies and considerations

Static stability

Testing your disaster recovery plan

Chaos engineering

Conclusion

Designing network connectivity that is reliable, secure, and highly performant

Architecting for high performance and resiliency

Conclusion

Ribbon Identity Hub

Data plane (call-transaction)

Control plane

Conclusion

Affordability verification and data categorisation in digital journeys

Deploying and publishing Experian CaaS ML models to Amazon SageMaker

Security and compliance using CaaS

Implementing upgrades to CaaS ML models

Conclusion

Considerations with replicating data

Replicating objects and files

Copying backups

Spanning non-relational databases across Regions

Spanning relational databases across Regions

Summary

Related posts

Solution benefits for global applications

Overview of solution

Walkthrough

Prerequisites

Steps

Cleaning up

Conclusion

#10: Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active

#9: Scaling up a Serverless Web Crawler and Search Engine

#8: Managing Asynchronous Workflows with a REST API

#7: Data Caching Across Microservices in a Serverless Architecture

#6: Disaster Recovery (DR) Architecture on AWS, Part III: Pilot Light and Warm Standby

#5: Issues to Avoid When Implementing Serverless Architecture with AWS Lambda

#4: Using Route 53 Private Hosted Zones for Cross-account Multi-region Architectures