More than 100,000 Zyxel firewalls, VPN gateways, and access point controllers contain a hardcoded admin-level backdoor account that can grant attackers root access to devices via either the SSH interface or the web administration panel.
Installing patches removes the backdoor account, which, according to Eye Control researchers, uses the “zyfwp” username and the “PrOw!aN_fXp” password.
“The plaintext password was visible in one of the binaries on the system,” the Dutch researchers said in a report published before the Christmas 2020 holiday.
Earlier this year, Apple patched one of the most breathtaking iPhone vulnerabilities ever: a memory corruption bug in the iOS kernel that gave attackers remote access to the entire device — over Wi-Fi, with no user interaction required at all. Oh, and exploits were wormable — meaning radio-proximity exploits could spread from one nearby device to another, once again, with no user interaction needed.
Beer’s attack worked by exploiting a buffer overflow bug in a driver for AWDL, an Apple-proprietary mesh networking protocol that makes things like Airdrop work. Because drivers reside in the kernel — one of the most privileged parts of any operating system — the AWDL flaw had the potential for serious hacks. And because AWDL parses Wi-Fi packets, exploits can be transmitted over the air, with no indication that anything is amiss.
Beer developed several different exploits. The most advanced one installs an implant that has full access to the user’s personal data, including emails, photos, messages, and passwords and crypto keys stored in the keychain. The attack uses a laptop, a Raspberry Pi, and some off-the-shelf Wi-Fi adapters. It takes about two minutes to install the prototype implant, but Beer said that with more work a better written exploit could deliver it in a “handful of seconds.” Exploits work only on devices that are within Wi-Fi range of the attacker.
There is no evidence that this vulnerability was ever used in the wild.
The issue is with a protocol called Cross-Transport Key Derivation (or CTKD, for short). When, say, an iPhone is getting ready to pair up with Bluetooth-powered device, CTKD’s role is to set up two separate authentication keys for that phone: one for a “Bluetooth Low Energy” device, and one for a device using what’s known as the “Basic Rate/Enhanced Data Rate” standard. Different devices require different amounts of data — and battery power — from a phone. Being able to toggle between the standards needed for Bluetooth devices that take a ton of data (like a Chromecast), and those that require a bit less (like a smartwatch) is more efficient. Incidentally, it might also be less secure.
According to the researchers, if a phone supports both of those standards but doesn’t require some sort of authentication or permission on the user’s end, a hackery sort who’s within Bluetooth range can use its CTKD connection to derive its own competing key. With that connection, according to the researchers, this sort of erzatz authentication can also allow bad actors to weaken the encryption that these keys use in the first place — which can open its owner up to more attacks further down the road, or perform “man in the middle” style attacks that snoop on unprotected data being sent by the phone’s apps and services.
Patches are not immediately available at the time of writing. The only way to protect against BLURtooth attacks is to control the environment in which Bluetooth devices are paired, in order to prevent man-in-the-middle attacks, or pairings with rogue devices carried out via social engineering (tricking the human operator).
However, patches are expected to be available at one point. When they’ll be, they’ll most likely be integrated as firmware or operating system updates for Bluetooth capable devices.
The timeline for these updates is, for the moment, unclear, as device vendors and OS makers usually work on different timelines, and some may not prioritize security patches as others. The number of vulnerable devices is also unclear and hard to quantify.
Many Bluetooth devices can’t be patched.
Final note: this seems to be another example of simultaneous discovery:
According to the Bluetooth SIG, the BLURtooth attack was discovered independently by two groups of academics from the École Polytechnique Fédérale de Lausanne (EPFL) and Purdue University.
Last year, ZecOps discovered two iPhone zero-day exploits. They will be patched in the next iOS release:
Avraham declined to disclose many details about who the targets were, and did not say whether they lost any data as a result of the attacks, but said “we were a bit surprised about who was targeted.” He said some of the targets were an executive from a telephone carrier in Japan, a “VIP” from Germany, managed security service providers from Saudi Arabia and Israel, people who work for a Fortune 500 company in North America, and an executive from a Swiss company.
On the other hand, this is not as polished a hack as others, as it relies on sending an oversized email, which may get blocked by certain email providers. Moreover, Avraham said it only works on the default Apple Mail app, and not on Gmail or Outlook, for example.
The vulnerability exists in Wi-Fi chips made by Cypress Semiconductor and Broadcom, the latter a chipmaker Cypress acquired in 2016. The affected devices include iPhones, iPads, Macs, Amazon Echos and Kindles, Android devices, and Wi-Fi routers from Asus and Huawei, as well as the Raspberry Pi 3. Eset, the security company that discovered the vulnerability, said the flaw primarily affects Cypress’ and Broadcom’s FullMAC WLAN chips, which are used in billions of devices. Eset has named the vulnerability Kr00k, and it is tracked as CVE-2019-15126.
Manufacturers have made patches available for most or all of the affected devices, but it’s not clear how many devices have installed the patches. Of greatest concern are vulnerable wireless routers, which often go unpatched indefinitely.
That’s the real problem. Many of these devices won’t get patched — ever.
The Zoom conferencing app hasavulnerability that allows someone to remotely take over the computer’s camera.
It’s a bad vulnerability, made worse by the fact that it remains even if you uninstall the Zoom app:
This vulnerability allows any website to forcibly join a user to a Zoom call, with their video camera activated, without the user’s permission.
On top of this, this vulnerability would have allowed any webpage to DOS (Denial of Service) a Mac by repeatedly joining a user to an invalid call.
Additionally, if you’ve ever installed the Zoom client and then uninstalled it, you still have a localhost web server on your machine that will happily re-install the Zoom client for you, without requiring any user interaction on your behalf besides visiting a webpage. This re-install ‘feature’ continues to work to this day.
Zoom didn’t take the vulnerability seriously:
This vulnerability was originally responsibly disclosed on March 26, 2019. This initial report included a proposed description of a ‘quick fix’ Zoom could have implemented by simply changing their server logic. It took Zoom 10 days to confirm the vulnerability. The first actual meeting about how the vulnerability would be patched occurred on June 11th, 2019, only 18 days before the end of the 90-day public disclosure deadline. During this meeting, the details of the vulnerability were confirmed and Zoom’s planned solution was discussed. However, I was very easily able to spot and describe bypasses in their planned fix. At this point, Zoom was left with 18 days to resolve the vulnerability. On June 24th after 90 days of waiting, the last day before the public disclosure deadline, I discovered that Zoom had only implemented the ‘quick fix’ solution originally suggested.
This is why we disclose vulnerabilities. Now, finally, Zoom is taking this seriously and fixing it for real.
Zcash just fixed a vulnerability that would have allowed “infinite counterfeit” Zcash.
Like all the other blockchain vulnerabilities and updates, this demonstrates the ridiculousness of the notion that code can replace people, that trust can be encompassed in the protocols, or that human governance is not ncessary.
The bug underscores the primary risk posed by IoT devices and connected appliances. Because they are commonly built by bolting on network connectivity to existing appliances, many IoT devices have little in the way of built-in network security.
Even when security measures are added to the devices, the third-party hardware used to make the appliances “smart” can itself contain security flaws or bad configurations that leave the device vulnerable.
“IoT devices are frequently overlooked from a security perspective; this may be because many are used for seemingly innocuous purposes such as simple home automation,” the McAfee researchers wrote.
“However, these devices run operating systems and require just as much protection as desktop computers.”
I’ll bet you anything that the plug cannot be patched, and that the vulnerability will remain until people throw them away.
Last week, researchersdisclosed vulnerabilities in a large number of encrypted e-mail clients: specifically, those that use OpenPGP and S/MIME, including Thunderbird and AppleMail. These are seriousvulnerabilities: An attacker who can alter mail sent to a vulnerable client can trick that client into sending a copy of the plaintext to a web server controlled by that attacker. The story of these vulnerabilities and the tale of how they were disclosed illustrate some important lessons about security vulnerabilities in general and e-mail security in particular.
But first, if you use PGP or S/MIME to encrypt e-mail, you need to check the list on this page and see if you are vulnerable. If you are, check with the vendor to see if they’ve fixed the vulnerability. (Note that some early patches turned out not to fix the vulnerability.) If not, stop using the encrypted e-mail program entirely until it’s fixed. Or, if you know how to do it, turn off your e-mail client’s ability to process HTML e-mail or — even better — stop decrypting e-mails from within the client. There’s even more complex advice for more sophisticated users, but if you’re one of those, you don’t need me to explain this to you.
Consider your encrypted e-mail insecure until this is fixed.
All software contains security vulnerabilities, and one of the primary ways we all improve our security is by researchers discovering those vulnerabilities and vendors patching them. It’s a weird system: Corporate researchers are motivated by publicity, academic researchers by publication credentials, and just about everyone by individual fame and the small bug-bounties paid by some vendors.
Software vendors, on the other hand, are motivated to fix vulnerabilities by the threat of public disclosure. Without the threat of eventual publication, vendors are likely to ignore researchers and delay patching. This happened a lot in the 1990s, and even today, vendors often use legal tactics to try to block publication. It makes sense; they look bad when their products are pronounced insecure.
Over the past few years, researchers have started to choreograph vulnerability announcements to make a big press splash. Clever names — the e-mail vulnerability is called “Efail” — websites, and cute logos are now common. Key reporters are given advance information about the vulnerabilities. Sometimes advance teasers are released. Vendors are now part of this process, trying to announce their patches at the same time the vulnerabilities are announced.
This simultaneous announcement is best for security. While it’s always possible that some organization — either government or criminal — has independently discovered and is using the vulnerability before the researchers go public, use of the vulnerability is essentially guaranteed after the announcement. The time period between announcement and patching is the most dangerous, and everyone except would-be attackers wants to minimize it.
Things get much more complicated when multiple vendors are involved. In this case, Efail isn’t a vulnerability in a particular product; it’s a vulnerability in a standard that is used in dozens of different products. As such, the researchers had to ensure both that everyone knew about the vulnerability in time to fix it and that no one leaked the vulnerability to the public during that time. As you can imagine, that’s close to impossible.
Efail was discovered sometime last year, and the researchers alerted dozens of different companies between last October and March. Some companies took the news more seriously than others. Most patched. Amazingly, news about the vulnerability didn’t leak until the day before the scheduled announcement date. Two days before the scheduled release, the researchers unveiled a teaser — honestly, a really bad idea — which resulted in details leaking.
All of this speaks to the difficulty of coordinating vulnerability disclosure when it involves a large number of companies or — even more problematic — communities without clear ownership. And that’s what we have with OpenPGP. It’s even worse when the bug involves the interaction between different parts of a system. In this case, there’s nothing wrong with PGP or S/MIME in and of themselves. Rather, the vulnerability occurs because of the way many e-mail programs handle encrypted e-mail. GnuPG, an implementation of OpenPGP, decided that the bug wasn’t its fault and did nothing about it. This is arguably true, but irrelevant. They should fix it.
Expect more of these kinds of problems in the future. The Internet is shifting from a set of systems we deliberately use — our phones and computers — to a fully immersive Internet-of-things world that we live in 24/7. And like this e-mail vulnerability, vulnerabilities will emerge through the interactions of different systems. Sometimes it will be obvious who should fix the problem. Sometimes it won’t be. Sometimes it’ll be two secure systems that, when they interact in a particular way, cause an insecurity. In April, I wrote about a vulnerability that arose because Google and Netflix make different assumptions about e-mail addresses. I don’t even know who to blame for that one.
It gets even worse. Our system of disclosure and patching assumes that vendors have the expertise and ability to patch their systems, but that simply isn’t true for many of the embedded and low-cost Internet of things software packages. They’re designed at a much lower cost, often by offshore teams that come together, create the software, and then disband; as a result, there simply isn’t anyone left around to receive vulnerability alerts from researchers and write patches. Even worse, many of these devices aren’t patchable at all. Right now, if you own a digital video recorder that’s vulnerable to being recruited for a botnet — remember Mirai from 2016? — the only way to patch it is to throw it away and buy a new one.
Patching is starting to fail, which means that we’re losing the best mechanism we have for improving software security at exactly the same time that software is gaining autonomy and physical agency. Many researchers and organizations, including myself, have proposed government regulations enforcing minimal security standards for Internet-of-things devices, including standards around vulnerability disclosure and patching. This would be expensive, but it’s hard to see any other viable alternative.
Getting back to e-mail, the truth is that it’s incredibly difficult to secure well. Not because the cryptography is hard, but because we expect e-mail to do so many things. We use it for correspondence, for conversations, for scheduling, and for record-keeping. I regularly search my 20-year e-mail archive. The PGP and S/MIME security protocols are outdated, needlessly complicated and have been difficult to properly use the whole time. If we could start again, we would design something better and more user friendlybut the huge number of legacy applications that use the existing standards mean that we can’t. I tell people that if they want to communicate securely with someone, to use one of the secure messaging systems: Signal, Off-the-Record, or — if having one of those two on your system is itself suspicious — WhatsApp. Of course they’re not perfect, as last week’s announcement of a vulnerability (patched within hours) in Signal illustrates. And they’re not as flexible as e-mail, but that makes them easier to secure.
The EU’s General Data Protection Regulation (GDPR) describes data processor and data controller roles, and some customers and AWS Partner Network (APN) partners are asking how this affects the long-established AWS Shared Responsibility Model. I wanted to take some time to help folks understand shared responsibilities for us and for our customers in context of the GDPR.
How does the AWS Shared Responsibility Model change under GDPR? The short answer – it doesn’t. AWS is responsible for securing the underlying infrastructure that supports the cloud and the services provided; while customers and APN partners, acting either as data controllers or data processors, are responsible for any personal data they put in the cloud. The shared responsibility model illustrates the various responsibilities of AWS and our customers and APN partners, and the same separation of responsibility applies under the GDPR.
AWS responsibilities as a data processor
The GDPR does introduce specific regulation and responsibilities regarding data controllers and processors. When any AWS customer uses our services to process personal data, the controller is usually the AWS customer (and sometimes it is the AWS customer’s customer). However, in all of these cases, AWS is always the data processor in relation to this activity. This is because the customer is directing the processing of data through its interaction with the AWS service controls, and AWS is only executing customer directions. As a data processor, AWS is responsible for protecting the global infrastructure that runs all of our services. Controllers using AWS maintain control over data hosted on this infrastructure, including the security configuration controls for handling end-user content and personal data. Protecting this infrastructure, is our number one priority, and we invest heavily in third-party auditors to test our security controls and make any issues they find available to our customer base through AWS Artifact. Our ISO 27018 report is a good example, as it tests security controls that focus on protection of personal data in particular.
AWS has an increased responsibility for our managed services. Examples of managed services include Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon Elastic MapReduce, and Amazon WorkSpaces. These services provide the scalability and flexibility of cloud-based resources with less operational overhead because we handle basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery. For most managed services, you only configure logical access controls and protect account credentials, while maintaining control and responsibility of any personal data.
Customer and APN partner responsibilities as data controllers — and how AWS Services can help
Our customers can act as data controllers or data processors within their AWS environment. As a data controller, the services you use may determine how you configure those services to help meet your GDPR compliance needs. For example, AWS Services that are classified as Infrastructure as a Service (IaaS), such as Amazon EC2, Amazon VPC, and Amazon S3, are under your control and require you to perform all routine security configuration and management that would be necessary no matter where the servers were located. With Amazon EC2 instances, you are responsible for managing: guest OS (including updates and security patches), application software or utilities installed on the instances, and the configuration of the AWS-provided firewall (called a security group).
To help you realize data protection by design principles under the GDPR when using our infrastructure, we recommend you protect AWS account credentials and set up individual user accounts with Amazon Identity and Access Management (IAM) so that each user is only given the permissions necessary to fulfill their job duties. We also recommend using multi-factor authentication (MFA) with each account, requiring the use of SSL/TLS to communicate with AWS resources, setting up API/user activity logging with AWS CloudTrail, and using AWS encryption solutions, along with all default security controls within AWS Services. You can also use advanced managed security services, such as Amazon Macie, which assists in discovering and securing personal data stored in Amazon S3.
For more information, you can download the AWS Security Best Practices whitepaper or visit the AWS Security Resources or GDPR Center webpages. In addition to our solutions and services, AWS APN partners can provide hundreds of tools and features to help you meet your security objectives, ranging from network security and configuration management to access control and data encryption.
Researchers havedisclosed a massive vulnerability in the VingCard eletronic lock system, used in hotel rooms around the world:
With a $300 Proxmark RFID card reading and writing tool, any expired keycard pulled from the trash of a target hotel, and a set of cryptographic tricks developed over close to 15 years of on-and-off analysis of the codes Vingcard electronically writes to its keycards, they found a method to vastly narrow down a hotel’s possible master key code. They can use that handheld Proxmark device to cycle through all the remaining possible codes on any lock at the hotel, identify the correct one in about 20 tries, and then write that master code to a card that gives the hacker free reign to roam any room in the building. The whole process takes about a minute.
The two researchers say that their attack works only on Vingcard’s previous-generation Vision locks, not the company’s newer Visionline product. But they estimate that it nonetheless affects 140,000 hotels in more than 160 countries around the world; the researchers say that Vingcard’s Swedish parent company, Assa Abloy, admitted to them that the problem affects millions of locks in total. When WIRED reached out to Assa Abloy, however, the company put the total number of vulnerable locks somewhat lower, between 500,000 and a million.
Patching is a nightmare. It requires updating the firmware on every lock individually.
And the researchers speculate whether or not others knew of this hack:
The F-Secure researchers admit they don’t know if their Vinguard attack has occurred in the real world. But the American firm LSI, which trains law enforcement agencies in bypassing locks, advertises Vingcard’s products among those it promises to teach students to unlock. And the F-Secure researchers point to a 2010 assassination of a Palestinian Hamas official in a Dubai hotel, widely believed to have been carried out by the Israeli intelligence agency Mossad. The assassins in that case seemingly used a vulnerability in Vingcard locks to enter their target’s room, albeit one that required re-programming the lock. “Most probably Mossad has a capability to do something like this,” Tuominen says.
By Mohit Goenka, Gnanavel Shanmugam, and Lance Welsh
At Yahoo Mail, we’re constantly striving to upgrade our product experience. We do this not only by adding new features based on our members’ feedback, but also by providing the best technical solutions to power the most engaging experiences. As such, we’ve recently introduced a number of novel and unique revisions to the way in which we use Redux that have resulted in significant stability and performance improvements. Developers may find our methods useful in achieving similar results in their apps.
Improvements to product metrics
when checking for new emails – 20%
when reading emails – 30%
when sending emails – 20%
10% improvement in page load performance
40% improvement in frame rendering time
We have also reduced API calls by approximately 20%.
How we use Redux in Yahoo Mail
Redux architecture is reliant on one large store that represents the application state. In a Redux cycle, action creators dispatch actions to change the state of the store. React Components then respond to those state changes. We’ve made some modifications on top of this architecture that are atypical in the React-Redux community.
For instance, when fetching data over the network, the traditional methodology is to use Thunk middleware. Yahoo Mail fetches data over the network from our API. Thunks would create an unnecessary and undesirable dependency between the action creators and our API. If and when the API changes, the action creators must then also change. To keep these concerns separate we dispatch the action payload from the action creator to store them in the Redux state for later processing by “action syncers”. Action syncers use the payload information from the store to make requests to the API and process responses. In other words, the action syncers form an API layer by interacting with the store. An additional benefit to keeping the concerns separate is that the API layer can change as the backend changes, thereby preventing such changes from bubbling back up into the action creators and components. This also allowed us to optimize the API calls by batching, deduping, and processing the requests only when the network is available. We applied similar strategies for handling other side effects like route handling and instrumentation. Overall, action syncers helped us to reduce our API calls by ~20% and bring down API errors by 20-30%.
Another change to the normal Redux architecture was made to avoid unnecessary props. The React-Redux community has learned to avoid passing unnecessary props from high-level components through multiple layers down to lower-level components (prop drilling) for rendering. We have introduced action enhancers middleware to avoid passing additional unnecessary props that are purely used when dispatching actions. Action enhancers add data to the action payload so that data does not have to come from the component when dispatching the action. This avoids the component from having to receive that data through props and has improved frame rendering by ~40%. The use of action enhancers also avoids writing utility functions to add commonly-used data to each action from action creators.
In our new architecture, the store reducers accept the dispatched action via action enhancers to update the state. The store then updates the UI, completing the action cycle. Action syncers then initiate the call to the backend APIs to synchronize local changes.
Our novel use of Redux in Yahoo Mail has led to significant user-facing benefits through a more performant application. It has also reduced development cycles for new features due to its simplified architecture. We’re excited to share our work with the community and would love to hear from anyone interested in learning more.
Today, our customers use AWS CloudHSM to meet corporate, contractual and regulatory compliance requirements for data security by using dedicated Hardware Security Module (HSM) instances within the AWS cloud. CloudHSM delivers all the benefits of traditional HSMs including secure generation, storage, and management of cryptographic keys used for data encryption that are controlled and accessible only by you.
As a managed service, it automates time-consuming administrative tasks such as hardware provisioning, software patching, high availability, backups and scaling for your sensitive and regulated workloads in a cost-effective manner. Backup and restore functionality is the core building block enabling scalability, reliability and high availability in CloudHSM.
You should consider using AWS CloudHSM if you require:
Keys stored in dedicated, third-party validated hardware security modules under your exclusive control
FIPS 140-2 compliance
Integration with applications using PKCS#11, Java JCE, or Microsoft CNG interfaces
Healthcare applications subject to HIPAA regulations
Streaming video solutions subject to contractual DRM requirements
We recently released a whitepaper, “Security of CloudHSM Backups” that provides in-depth information on how backups are protected in all three phases of the CloudHSM backup lifecycle process: Creation, Archive, and Restore.
About the Author
Balaji Iyer is a senior consultant in the Professional Services team at Amazon Web Services. In this role, he has helped several customers successfully navigate their journey to AWS. His specialties include architecting and implementing highly-scalable distributed systems, operational security, large scale migrations, and leading strategic AWS initiatives.
Applying technology to healthcare data has the potential to produce many exciting and important outcomes. The analysis produced from healthcare data can empower clinicians to improve the health of individuals and populations by enabling them to make better decisions that enhance the care they provide.
The Observational Health Data Sciences and Informatics (OHDSI, pronounced “Odyssey”) program and community is working toward this goal by producing data standards and open-source solutions to store and analyze observational health data. Using the OHDSI tools, you can visualize the health of your entire population. You can build cohorts of patients, analyze incidence rates for various conditions, and estimate the effect of treatments on patients with certain conditions. You can also model health outcome predictions using machine learning algorithms.
One of the challenges often faced when working with big data tools is the expense of the infrastructure required to run them. Another challenge is the learning curve to implement and begin using these tools. Amazon Web Services has enabled us to address many of the classic IT challenges by making enterprise class infrastructure and technology available in an affordable, elastic, and automated way. This blog post demonstrates how to combine some of the OHDSI projects (Atlas, Achilles, WebAPI, and the OMOP Common Data Model) with AWS technologies. By doing so, you can quickly and inexpensively implement a health data science and informatics environment.
Shown following is just one example of the population health analysis that is possible with the OHDSI tools. This visualization shows the prevalence of various drugs within the given population of people. This information helps researchers and clinicians discover trends and make better informed decisions about patient health.
OHDSI application architecture on AWS
Before deploying an application on AWS that transmits, processes, or stores protected health information (PHI) or personally identifiable information (PII), address your organization’s compliance concerns. Make sure that you have worked with your internal compliance and legal team to ensure compliance with the laws and regulations that govern your organization. To understand how you can use AWS services as a part of your overall compliance program, see the AWS HIPAA Compliance whitepaper. With that said, we paid careful attention to the HIPAA control set during the design of this solution.
This blog post presents a complete OHDSI application environment, including a data warehouse with sample data. It has the following features:
Following, you can see a block diagram of how the OHDSI tools map to the services provided by AWS.
Atlas is the web application that researchers interact with to perform analysis. Atlas interacts with the underlying databases through a web services application named WebAPI. In this example, both Atlas and WebAPI are deployed and managed by AWS Elastic Beanstalk. Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications. Simply upload the Atlas and WebAPI code and Elastic Beanstalk automatically handles the deployment. It covers everything from capacity provisioning, load balancing, autoscaling, and high availability, to application health monitoring. Using a feature of Elastic Beanstalk called ebextensions, the Atlas and WebAPI servers are customized to use an encrypted storage volume for the middleware application logs.
Atlas stores the state of the various patient cohorts that are analyzed in a dedicated database separate from your observational health data. This database is provided by Amazon Aurora with PostgreSQL compatibility.
Amazon Aurora is a relational database built for the cloud that combines the performance and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. It provides cost-efficient and resizable capacity while automating time-consuming administration tasks such as hardware provisioning, database setup, patching, and backups. It is configured for high availability and uses encryption at rest for the database and backups, and encryption in flight for the JDBC connections.
All of your observational health data is stored inside the OHDSI Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). This model also stores useful vocabulary tables that help to translate values from various data sources (like EHR systems and claims data).
The OMOP CDM schema is deployed onto Amazon Redshift. Amazon Redshift is a fast, fully managed data warehouse that allows you to run complex analytic queries against petabytes of structured data. It uses using sophisticated query optimization, columnar storage on high-performance local disks, and massively parallel query execution. You can also resize an Amazon Redshift cluster as your requirements for it change.
The solution in this blog post automatically loads de-identified sample data of 1,000 people from the CMS 2008–2010 Data Entrepreneurs’ Synthetic Public Use File (DE-SynPUF). The data has helpful formatting from LTS Computing LLC. Vocabulary data from the OHDSI Athena project is also loaded into the OMOP CDM, and a results set is computed by OHDSI Achilles.
Following is a detailed technical diagram showing the configuration of the architecture to be deployed.
Deploying OHDSI on AWS
Everything just described is automatically deployed by using an AWS CloudFormation template. Using this template, you can quickly get started with the OHDSI project. The CloudFormation templates for this deployment as well as all of the supporting scripts and source code can be found in the AWS Labs GitHub repo.
From your AWS account, open the CloudFormation Management Console and choose Create Stack. From there, copy and paste the following URL in the Specify an Amazon S3 template URL box, and choose Next.
On the next screen, you provide a Stack Name (this can be anything you like) and a few other parameters for your OHDSI environment.
You use the DatabasePassword parameter to set the password for the master user account of the Amazon Redshift and Aurora databases.
You use the EBEndpoint name to generate a unique URL for Atlas to access the OHDSI environment. It is http://EBEndpoint.AWS-Region.elasticbeanstalk.com, where EBEndpoint.AWS-Region indicates the Elastic Beanstalk endpoint and AWS Region. You can configure this URL through Elastic Beanstalk if you want to change it in the future.
You use the KPair option to choose one of your existing Amazon EC2 key pairs to use with the instances that Elastic Beanstalk deploys. By doing this, you can gain administrative access to these instances in the future if you need to. If you don’t already have an Amazon EC2 key pair, you can generate one for free. You do this by going to the Key Pairs section of the EC2 console and choosing Generate Key Pair.
Finally, you use the UserIPRange parameter to specify a CIDR IP address range from which to access your OHDSI environment. By default, your OHDSI environment is accessible over the public internet. Use UserIPRange to limit access over the Internet to a single IP address or a range of IP addresses that represent users you want to have access. Through additional configuration, you can also make your OHDSI environment completely private and accessible only through a VPN or AWS Direct Connect private circuit.
When you’ve provided all Parameters, choose Next.
On the next screen, you can provide some other optional information like tags at your discretion, or just choose Next.
On the next screen, you can review what will be deployed. At the bottom of the screen, there is a check box for you to acknowledge that AWS CloudFormation might create IAM resources with custom names. This is correct; the template being deployed creates four custom roles that give permission for the AWS services involved to communicate with each other. Details of these permissions are inside the CloudFormation template referenced in the URL given in the first step. Check the box acknowledging this and choose Next.
You can watch as CloudFormation builds out your OHDSI architecture. A CloudFormation deployment is called a stack. The parent stack creates two child stacks, one containing the VPC and IAM roles and another created by Elastic Beanstalk with the Atlas and WebAPI servers. When all three stacks have reached the green CREATE_COMPLETE status, as shown in the screenshot following, then the OHDSI architecture has been deployed.
There is still some work going on behind the scenes, though. To watch the progress, browse to the Amazon Redshift section of your AWS Management Console and choose the Amazon Redshift cluster that was created for your OHDSI architecture. After you do so, you can observe the Loads and Queries tabs.
First, on the Loads tab, you can see the CMS De-SynPUF sample data and Athena vocabulary data being loaded into the OMOP Common Data Model. After you see the VOCABULARY table reach the COMPLETED status (as shown following), all of the sample and vocabulary data has been loaded.
After the data loads, the Achilles computation starts. On the Queries tab, you can watch Achilles running queries against your database to build out the Results schema. Achilles runs a large number of queries, and the entire process can take quite some time (about 20 minutes for the sample data we’ve loaded). Eventually, no new queries show up in the Queries tab, which shows that the Achilles computation is completed. The entire process from the time you executed the CloudFormation template until the Achilles computation is completed usually takes about an hour and 15 minutes.
At this point, you can browse to the Elastic Beanstalk section of the AWS Management Console. There, you can choose the OHDSI Application and Environment (green box) that was deployed by the CloudFormation template. At the top of the dashboard, as shown following, you see a link to a URL. This URL matches the name you provided in the EBEndpoint parameter of the CloudFormation template. Choose this URL, and you can start using Atlas to explore the CMS DE-SynPUF sample data!
Cost of deploying this environment
It used to be common to see healthcare data analytics environments deployed in an on-premises data center with expensive data warehouse appliances and virtualized environments. The cloud era has democratized the availability of the infrastructure required to do this type of data analysis, so that now it is within reach of even small organizations. This environment can expand to analyze petabyte-scale health data, and you only pay for what you need. See an estimated breakdown of the monthly cost components for this environment as deployed on the AWS Solution Calculator.
It’s also worth noting that this environment does not have to be run all of the time. If you are only performing analyses periodically, you can terminate the environment when you are finished and restore it from the database backups when you want to continue working. This would reduce the cost of operation even further.
Now that you have a fully functional OHDSI environment with sample data, you can use this to explore and learn the toolset and its capabilities. After learning with the sample data, you can begin gaining insights by analyzing your own organization’s health data. You can do this using an extract, transform, load (ETL) process from one or more of your health data sources.
James Wiggins is a senior healthcare solutions architect at AWS. He is passionate about using technology to help organizations positively impact world health. He also loves spending time with his wife and three children.
Artificial intelligence technologies have the potential to upend the longstanding advantage that attack has over defense on the Internet. This has to do with the relative strengths and weaknesses of people and computers, how those all interplay in Internet security, and where AI technologies might change things.
You can divide Internet security tasks into two sets: what humans do well and what computers do well. Traditionally, computers excel at speed, scale, and scope. They can launch attacks in milliseconds and infect millions of computers. They can scan computer code to look for particular kinds of vulnerabilities, and data packets to identify particular kinds of attacks.
Humans, conversely, excel at thinking and reasoning. They can look at the data and distinguish a real attack from a false alarm, understand the attack as it’s happening, and respond to it. They can find new sorts of vulnerabilities in systems. Humans are creative and adaptive, and can understand context.
Computers — so far, at least — are bad at what humans do well. They’re not creative or adaptive. They don’t understand context. They can behave irrationally because of those things.
Humans are slow, and get bored at repetitive tasks. They’re terrible at big data analysis. They use cognitive shortcuts, and can only keep a few data points in their head at a time. They can also behave irrationally because of those things.
AI will allow computers to take over Internet security tasks from humans, and then do them faster and at scale. Here are possible AI capabilities:
Discovering new vulnerabilities — and, more importantly, new types of vulnerabilities in systems, both by the offense to exploit and by the defense to patch, and then automatically exploiting or patching them.
Reacting and adapting to an adversary’s actions, again both on the offense and defense sides. This includes reasoning about those actions and what they mean in the context of the attack and the environment.
Abstracting lessons from individual incidents, generalizing them across systems and networks, and applying those lessons to increase attack and defense effectiveness elsewhere.
Identifying strategic and tactical trends from large datasets and using those trends to adapt attack and defense tactics.
That’s an incomplete list. I don’t think anyone can predict what AI technologies will be capable of. But it’s not unreasonable to look at what humans do today and imagine a future where AIs are doing the same things, only at computer speeds, scale, and scope.
Both attack and defense will benefit from AI technologies, but I believe that AI has the capability to tip the scales more toward defense. There will be better offensive and defensive AI techniques. But here’s the thing: defense is currently in a worse position than offense precisely because of the human components. Present-day attacks pit the relative advantages of computers and humans against the relative weaknesses of computers and humans. Computers moving into what are traditionally human areas will rebalance that equation.
Roy Amara famously said that we overestimate the short-term effects of new technologies, but underestimate their long-term effects. AI is notoriously hard to predict, so many of the details I speculate about are likely to be wrong — and AI is likely to introduce new asymmetries that we can’t foresee. But AI is the most promising technology I’ve seen for bringing defense up to par with offense. For Internet security, that will change everything.
***This blog is authored by Mike Okner of Monsanto, an AWS customer. It originally appeared on the Monsanto company blog. Minor edits were made to the original post.***
Recently, I was looking to create a status page app to monitor a few important internal services. I wanted this app to be as lightweight, reliable, and hassle-free as possible, so using a “serverless” architecture that doesn’t require any patching or other maintenance was quite appealing.
I also don’t deploy anything in a production AWS environment outside of some sort of template (usually CloudFormation) as a rule. I don’t want to have to come back to something I created ad hoc in the console after 6 months and try to recall exactly how I architected all of the resources. I’ll inevitably forget something and create more problems before solving the original one. So building the status page in a template was a requirement.
The Design I settled on a design using two Lambda functions, both written in Python 3.6.
The first Lambda function makes requests out to a list of important services and writes their current status to a DynamoDB table. This function is executed once per minute via CloudWatch Event Rule.
The second Lambda function reads each service’s status & uptime information from DynamoDB and renders a Jinja template. This function is behind an API Gateway that has been configured to return text/html instead of its default application/json Content-Type.
The CloudFormation Template AWS provides a Serverless Application Model template transformer to streamline the templating of Lambda + API Gateway designs, but it assumes (like everything else about the API Gateway) that you’re actually serving an API that returns JSON content. So, unfortunately, it won’t work for this use-case because we want to return HTML content. Instead, we’ll have to enumerate every resource like usual.
The Skeleton We’ll be using YAML for the template in this example. I find it easier to read than JSON, but you can easily convert between the two with a converter if you disagree.
The CheckerLambda is the actual Lambda function. The Code section is a local path to a ZIP file containing the code and its dependencies. I’m using CloudFormation’s packaging feature to automatically push the deployable to S3.
The CheckerLambdaRole is the IAM role the Lambda will assume which grants it access to DynamoDB in addition to the usual Lambda logging permissions.
The CheckerLambdaTimer is the CloudWatch Events Rule that triggers the checker to run once per minute.
The CheckerLambdaTimerPermission grants CloudWatch the ability to invoke the checker Lambda function on its interval.
The Web Page Gateway The API Gateway handles incoming requests for the web page, invokes the Lambda, and then returns the Lambda’s results as HTML content. Its template looks like:
There’s a lot going on here, but the real meat is in the PageGatewayMethod section. There are a couple properties that deviate from the default which is why we couldn’t use the SAM transformer.
First, we’re passing request headers through to the Lambda in theRequestTemplates section. I’m doing this so I can validate incoming auth headers. The API Gateway can do some types of auth, but I found it easier to check auth myself in the Lambda function since the Gateway is designed to handle API calls and not browser requests.
Next, note that in the IntegrationResponses section we’re defining the Content-Type header to be ‘text/html’ (with single-quotes) and defining the ResponseTemplate to be $input.path(‘$’). This is what makes the request render as a HTML page in your browser instead of just raw text.
Due to the StageName and PathPart values in the other sections, your actual page will be accessible at https://someId.execute-api.region.amazonaws.com/Prod/page. I have the page behind an existing reverse-proxy and give it a saner URL for end-users. The reverse proxy also attaches the auth header I mentioned above. If that header isn’t present, the Lambda will render an error page instead so the proxy can’t be bypassed.
The Web Page Rendering Lambda This Lambda is invoked by calls to the API Gateway and looks like:
The WebRenderLambda and WebRenderLambdaRole should look familiar.
The WebRenderLambdaGatewayPermission is similar to the Status Checker’s CloudWatch permission, only this time it allows the API Gateway to invoke this Lambda.
The DynamoDB Table This one is straightforward.
# DynamoDB table
- AttributeName: name
- KeyType: HASH
The Deployment We’ve made it this far defining every resource in a template that we can check in to version control, so we might as well script the deployment as well rather than manually manage the CloudFormation Stack via the AWS web console.
Since I’m using the packaging feature, I first run:
$ aws cloudformation package \
--template-file template.yaml \
--s3-bucket <some-bucket-name> \
Uploading to 34cd6e82c5e8205f9b35e71afd9e1548 1922559 / 1922559.0 (100.00%) Successfully packaged artifacts and wrote output template to file template-packaged.yaml.
Then to deploy the template (whether new or modified), I run:
$ aws cloudformation deploy \
--region '<aws-region>' \
--template-file template-packaged.yaml \
--stack-name '<some-name>' \
Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - <some-name>
And that’s it! You’ve just created a dynamic web page that will never require you to SSH anywhere, patch a server, recover from a disaster after Amazon terminates your unhealthy EC2, or any other number of pitfalls that are now the problem of some ops person at AWS. And you can reproduce deployments and make changes with confidence because everything is defined in the template and can be tracked in version control.
Kumar Venkateswar, Product Manager on the AWS ML Platforms Team, shares details on the announcement of Auto Scaling with Amazon SageMaker.
With Amazon SageMaker, thousands of customers have been able to easily build, train and deploy their machine learning (ML) models. Today, we’re making it even easier to manage production ML models, with Auto Scaling for Amazon SageMaker. Instead of having to manually manage the number of instances to match the scale that you need for your inferences, you can now have SageMaker automatically scale the number of instances based on an AWS Auto Scaling Policy.
SageMaker has made managing the ML process easier for many customers. We’ve seen customers take advantage of managed Jupyter notebooks and managed distributed training. We’ve seen customers deploying their models to SageMaker hosting for inferences, as they integrate machine learning with their applications. SageMaker makes this easy – you don’t have to think about patching the operating system (OS) or frameworks on your inference hosts, and you don’t have to configure inference hosts across Availability Zones. You just deploy your models to SageMaker, and it handles the rest.
Until now, you have needed to specify the number and type of instances per endpoint (or production variant) to provide the scale that you need for your inferences. If your inference volume changes, you can change the number and/or type of instances that back each endpoint to accommodate that change, without incurring any downtime. In addition to making it easy to change provisioning, customers have asked us how we can make managing capacity for SageMaker even easier.
With Auto Scaling for Amazon SageMaker, in the SageMaker console, the AWS Auto Scaling API, and the AWS SDK, this becomes much easier. Now, instead of having to closely monitor inference volume, and change the endpoint configuration in response, customers can configure a scaling policy to be used by AWS Auto Scaling. Auto Scaling adjusts the number of instances up or down in response to actual workloads, determined by using Amazon CloudWatch metrics and target values defined in the policy. In this way, customers can automatically adjust their inference capacity to maintain predictable performance at a low cost. You simply specify the target inference throughput per instance and provide upper and lower bounds for the number of instances for each production variant. SageMaker will then monitor throughput per instance using Amazon CloudWatch alarms, and then it will adjust provisioned capacity up or down as needed.
After you configure the endpoint with Auto Scaling, SageMaker will continue to monitor your deployed models to automatically adjust the instance count. SageMaker will keep throughput within desired levels, in response to changes in application traffic. This makes it easier to manage models in production, and it can help reduce the cost of deployed models, as you no longer have to provision sufficient capacity in order to manage your peak load. Instead, you configure the limits to accommodate your minimum expected traffic and the maximum peak, and Amazon SageMaker will work within those limits to minimize cost.
How do you get started? Open the SageMaker console. For existing endpoints, you first access the endpoint to modify the settings.
Then, scroll to the Endpoint runtime settings section, select the variant, and choose Configure auto scaling.
First, configure the minimum and maximum number of instances.
Next, choose the throughput per instance at which you want to add an additional instance, given previous load testing.
You can optionally set cool down periods for scaling in or out, to avoid oscillation during periods of wide fluctuation in workload. If not, SageMaker will assume default values.
And that’s it! You now have an endpoint that will automatically scale with increasing inferences.
You pay for the capacity used at regular SageMaker pay-as-you-go pricing, so you no longer have to pay for unused capacity during relative idle periods!
Kumar Venkateswar is a Product Manager in the AWS ML Platforms team, which includes Amazon SageMaker, Amazon Machine Learning, and the AWS Deep Learning AMIs. When not working, Kumar plays the violin and Magic: The Gathering.
This post was written in partnership with Intuit to share learnings, best practices, and recommendations for running an Apache Kafka cluster on AWS. Thanks to Vaishak Suresh and his colleagues at Intuit for their contribution and support.
Intuit, in their own words: Intuit, a leading enterprise customer for AWS, is a creator of business and financial management solutions. For more information on how Intuit partners with AWS, see our previous blog post, Real-time Stream Processing Using Apache Spark Streaming and Apache Kafka on AWS. Apache Kafka is an open-source, distributed streaming platform that enables you to build real-time streaming applications.
The best practices described in this post are based on our experience in running and operating large-scale Kafka clusters on AWS for more than two years. Our intent for this post is to help AWS customers who are currently running Kafka on AWS, and also customers who are considering migrating on-premises Kafka deployments to AWS.
Running your Kafka deployment on Amazon EC2 provides a high performance, scalable solution for ingesting streaming data. AWS offers many different instance types and storage option combinations for Kafka deployments. However, given the number of possible deployment topologies, it’s not always trivial to select the most appropriate strategy suitable for your use case.
In this blog post, we cover the following aspects of running Kafka clusters on AWS:
Deployment considerations and patterns
Backup and restore
Note: While implementing Kafka clusters in a production environment, make sure also to consider factors like your number of messages, message size, monitoring, failure handling, and any operational issues.
Deployment considerations and patterns
In this section, we discuss various deployment options available for Kafka on AWS, along with pros and cons of each option. A successful deployment starts with thoughtful consideration of these options. Considering availability, consistency, and operational overhead of the deployment helps when choosing the right option.
Single AWS Region, Three Availability Zones, All Active
One typical deployment pattern (all active) is in a single AWS Region with three Availability Zones (AZs). One Kafka cluster is deployed in each AZ along with Apache ZooKeeper and Kafka producer and consumer instances as shown in the illustration following.
In this pattern, this is the Kafka cluster deployment:
Kafka producers and Kafka cluster are deployed on each AZ.
Data is distributed evenly across three Kafka clusters by using Elastic Load Balancer.
Kafka consumers aggregate data from all three Kafka clusters.
Kafka cluster failover occurs this way:
Mark down all Kafka producers
Debug and restack Kafka
Restart Kafka producers
Following are the pros and cons of this pattern.
Can sustain the failure of two AZs
No message loss during failover
Very high operational overhead:
All changes need to be deployed three times, one for each Kafka cluster
Maintaining and monitoring three Kafka clusters
Maintaining and monitoring three consumer clusters
A restart is required for patching and upgrading brokers in a Kafka cluster. In this approach, a rolling upgrade is done separately for each cluster.
Single Region, Three Availability Zones, Active-Standby
Another typical deployment pattern (active-standby) is in a single AWS Region with a single Kafka cluster and Kafka brokers and Zookeepers distributed across three AZs. Another similar Kafka cluster acts as a standby as shown in the illustration following. You can use Kafka mirroring with MirrorMaker to replicate messages between any two clusters.
In this pattern, this is the Kafka cluster deployment:
Kafka producers are deployed on all three AZs.
Only one Kafka cluster is deployed across three AZs (active).
ZooKeeper instances are deployed on each AZ.
Brokers are spread evenly across all three AZs.
Kafka consumers can be deployed across all three AZs.
Standby Kafka producers and a Multi-AZ Kafka cluster are part of the deployment.
Kafka cluster failover occurs this way:
Switch traffic to standby Kafka producers cluster and Kafka cluster.
Restart consumers to consume from standby Kafka cluster.
Following are the pros and cons of this pattern.
Less operational overhead when compared to the first option
Only one Kafka cluster to manage and consume data from
Can handle single AZ failures without activating a standby Kafka cluster
Added latency due to cross-AZ data transfer among Kafka brokers
For Kafka versions before 0.10, replicas for topic partitions have to be assigned so they’re distributed to the brokers on different AZs (rack-awareness)
The cluster can become unavailable in case of a network glitch, where ZooKeeper does not see Kafka brokers
Possibility of in-transit message loss during failover
Intuit recommends using a single Kafka cluster in one AWS Region, with brokers distributing across three AZs (single region, three AZs). This approach offers stronger fault tolerance than otherwise, because a failed AZ won’t cause Kafka downtime.
There are two storage options for file storage in Amazon EC2:
Ephemeral storage is local to the Amazon EC2 instance. It can provide high IOPS based on the instance type. On the other hand, Amazon EBS volumes offer higher resiliency and you can configure IOPS based on your storage needs. EBS volumes also offer some distinct advantages in terms of recovery time. Your choice of storage is closely related to the type of workload supported by your Kafka cluster.
Kafka provides built-in fault tolerance by replicating data partitions across a configurable number of instances. If a broker fails, you can recover it by fetching all the data from other brokers in the cluster that host the other replicas. Depending on the size of the data transfer, it can affect recovery process and network traffic. These in turn eventually affect the cluster’s performance.
The following table contrasts the benefits of using an instance store versus using EBS for storage.
Instance storage is recommended for large- and medium-sized Kafka clusters. For a large cluster, read/write traffic is distributed across a high number of brokers, so the loss of a broker has less of an impact. However, for smaller clusters, a quick recovery for the failed node is important, but a failed broker takes longer and requires more network traffic for a smaller Kafka cluster.
Storage-optimized instances like h1, i3, and d2 are an ideal choice for distributed applications like Kafka.
The primary advantage of using EBS in a Kafka deployment is that it significantly reduces data-transfer traffic when a broker fails or must be replaced. The replacement broker joins the cluster much faster.
Data stored on EBS is persisted in case of an instance failure or termination. The broker’s data stored on an EBS volume remains intact, and you can mount the EBS volume to a new EC2 instance. Most of the replicated data for the replacement broker is already available in the EBS volume and need not be copied over the network from another broker. Only the changes made after the original broker failure need to be transferred across the network. That makes this process much faster.
Intuit chose EBS because of their frequent instance restacking requirements and also other benefits provided by EBS.
Generally, Kafka deployments use a replication factor of three. EBS offers replication within their service, so Intuit chose a replication factor of two instead of three.
The choice of instance types is generally driven by the type of storage required for your streaming applications on a Kafka cluster. If your application requires ephemeral storage, h1, i3, and d2 instances are your best option.
The network plays a very important role in a distributed system like Kafka. A fast and reliable network ensures that nodes can communicate with each other easily. The available network throughput controls the maximum amount of traffic that Kafka can handle. Network throughput, combined with disk storage, is often the governing factor for cluster sizing.
If you expect your cluster to receive high read/write traffic, select an instance type that offers 10-Gb/s performance.
In addition, choose an option that keeps interbroker network traffic on the private subnet, because this approach allows clients to connect to the brokers. Communication between brokers and clients uses the same network interface and port. For more details, see the documentation about IP addressing for EC2 instances.
If you are deploying in more than one AWS Region, you can connect the two VPCs in the two AWS Regions using cross-region VPC peering. However, be aware of the networking costs associated with cross-AZ deployments.
Kafka has a history of not being backward compatible, but its support of backward compatibility is getting better. During a Kafka upgrade, you should keep your producer and consumer clients on a version equal to or lower than the version you are upgrading from. After the upgrade is finished, you can start using a new protocol version and any new features it supports. There are three upgrade approaches available, discussed following.
Rolling or in-place upgrade
In a rolling or in-place upgrade scenario, upgrade one Kafka broker at a time. Take into consideration the recommendations for doing rolling restarts to avoid downtime for end users.
If you can afford the downtime, you can take your entire cluster down, upgrade each Kafka broker, and then restart the cluster.
Intuit followed the blue/green deployment model for their workloads, as described following.
If you can afford to create a separate Kafka cluster and upgrade it, we highly recommend the blue/green upgrade scenario. In this scenario, we recommend that you keep your clusters up-to-date with the latest Kafka version. For additional details on Kafka version upgrades or more details, see the Kafka upgrade documentation.
The following illustration shows a blue/green upgrade.
In this scenario, the upgrade plan works like this:
Create a new Kafka cluster on AWS.
Create a new Kafka producers stack to point to the new Kafka cluster.
Create topics on the new Kafka cluster.
Test the green deployment end to end (sanity check).
Using Amazon Route 53, change the new Kafka producers stack on AWS to point to the new green Kafka environment that you have created.
The roll-back plan works like this:
Switch Amazon Route 53 to the old Kafka producers stack on AWS to point to the old Kafka environment.
You can tune Kafka performance in multiple dimensions. Following are some best practices for performance tuning.
These are some general performance tuning techniques:
If throughput is less than network capacity, try the following:
Add more threads
Increase batch size
Add more producer instances
Add more partitions
To improve latency when acks =-1, increase your num.replica.fetches value.
For cross-AZ data transfer, tune your buffer settings for sockets and for OS TCP.
Make sure that num.io.threads is greater than the number of disks dedicated for Kafka.
Adjust num.network.threads based on the number of producers plus the number of consumers plus the replication factor.
Your message size affects your network bandwidth. To get higher performance from a Kafka cluster, select an instance type that offers 10 Gb/s performance.
For Java and JVM tuning, try the following:
Minimize GC pauses by using the Oracle JDK, which uses the new G1 garbage-first collector.
Try to keep the Kafka heap size below 4 GB.
Knowing whether a Kafka cluster is working correctly in a production environment is critical. Sometimes, just knowing that the cluster is up is enough, but Kafka applications have many moving parts to monitor. In fact, it can easily become confusing to understand what’s important to watch and what you can set aside. Items to monitor range from simple metrics about the overall rate of traffic, to producers, consumers, brokers, controller, ZooKeeper, topics, partitions, messages, and so on.
For monitoring, Intuit used several tools, including Newrelec, Wavefront, Amazon CloudWatch, and AWS CloudTrail. Our recommended monitoring approach follows.
For system metrics, we recommend that you monitor:
File handle usage
Disk I/O performance
For producers, we recommend that you monitor:
For consumers, we recommend that you monitor:
Like most distributed systems, Kafka provides the mechanisms to transfer data with relatively high security across the components involved. Depending on your setup, security might involve different services such as encryption, Kerberos, Transport Layer Security (TLS) certificates, and advanced access control list (ACL) setup in brokers and ZooKeeper. The following tells you more about the Intuit approach. For details on Kafka security not covered in this section, see the Kafka documentation.
Kafka uses TLS for client and internode communications.
Authentication of connections to brokers from clients (producers and consumers) to other brokers and tools uses either Secure Sockets Layer (SSL) or Simple Authentication and Security Layer (SASL).
Kafka supports Kerberos authentication. If you already have a Kerberos server, you can add Kafka to your current configuration.
In Kafka, authorization is pluggable and integration with external authorization services is supported.
Backup and restore
The type of storage used in your deployment dictates your backup and restore strategy.
The best way to back up a Kafka cluster based on instance storage is to set up a second cluster and replicate messages using MirrorMaker. Kafka’s mirroring feature makes it possible to maintain a replica of an existing Kafka cluster. Depending on your setup and requirements, your backup cluster might be in the same AWS Region as your main cluster or in a different one.
For EBS-based deployments, you can enable automatic snapshots of EBS volumes to back up volumes. You can easily create new EBS volumes from these snapshots to restore. We recommend storing backup files in Amazon S3.
For more information on how to back up in Kafka, see the Kafka documentation.
In this post, we discussed several patterns for running Kafka in the AWS Cloud. AWS also provides an alternative managed solution with Amazon Kinesis Data Streams, there are no servers to manage or scaling cliffs to worry about, you can scale the size of your streaming pipeline in seconds without downtime, data replication across availability zones is automatic, you benefit from security out of the box, Kinesis Data Streams is tightly integrated with a wide variety of AWS services like Lambda, Redshift, Elasticsearch and it supports open source frameworks like Storm, Spark, Flink, and more. You may refer to kafka-kinesis connector.
If you have questions or suggestions, please comment below.
Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.