Tag Archives: utility

Manage Kubernetes Clusters on AWS Using CoreOS Tectonic

Post Syndicated from Arun Gupta original https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-coreos-tectonic/

There are multiple ways to run a Kubernetes cluster on Amazon Web Services (AWS). The first post in this series explained how to manage a Kubernetes cluster on AWS using kops. This second post explains how to manage a Kubernetes cluster on AWS using CoreOS Tectonic.

Tectonic overview

Tectonic delivers the most current upstream version of Kubernetes with additional features. It is a commercial offering from CoreOS and adds the following features over the upstream:

  • Installer
    Comes with a graphical installer that installs a highly available Kubernetes cluster. Alternatively, the cluster can be installed using AWS CloudFormation templates or Terraform scripts.
  • Operators
    An operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user. This release includes an etcd operator for rolling upgrades and a Prometheus operator for monitoring capabilities.
  • Console
    A web console provides a full view of applications running in the cluster. It also allows you to deploy applications to the cluster and start the rolling upgrade of the cluster.
  • Monitoring
    Node CPU and memory metrics are powered by the Prometheus operator. The graphs are available in the console. A large set of preconfigured Prometheus alerts are also available.
  • Security
    Tectonic ensures that cluster is always up to date with the most recent patches/fixes. Tectonic clusters also enable role-based access control (RBAC). Different roles can be mapped to an LDAP service.
  • Support
    CoreOS provides commercial support for clusters created using Tectonic.

Tectonic can be installed on AWS using a GUI installer or Terraform scripts. The installer prompts you for the information needed to boot the Kubernetes cluster, such as AWS access and secret key, number of master and worker nodes, and instance size for the master and worker nodes. The cluster can be created after all the options are specified. Alternatively, Terraform assets can be downloaded and the cluster can be created later. This post shows using the installer.

CoreOS License and Pull Secret

Even though Tectonic is a commercial offering, a cluster for up to 10 nodes can be created by creating a free account at Get Tectonic for Kubernetes. After signup, a CoreOS License and Pull Secret files are provided on your CoreOS account page. Download these files as they are needed by the installer to boot the cluster.

IAM user permission

The IAM user to create the Kubernetes cluster must have access to the following services and features:

  • Amazon Route 53
  • Amazon EC2
  • Elastic Load Balancing
  • Amazon S3
  • Amazon VPC
  • Security groups

Use the aws-policy policy to grant the required permissions for the IAM user.

DNS configuration

A subdomain is required to create the cluster, and it must be registered as a public Route 53 hosted zone. The zone is used to host and expose the console web application. It is also used as the static namespace for the Kubernetes API server. This allows kubectl to be able to talk directly with the master.

The domain may be registered using Route 53. Alternatively, a domain may be registered at a third-party registrar. This post uses a kubernetes-aws.io domain registered at a third-party registrar and a tectonic subdomain within it.

Generate a Route 53 hosted zone using the AWS CLI. Download jq to run this command:

ID=$(uuidgen) && \
aws route53 create-hosted-zone \
--name tectonic.kubernetes-aws.io \
--caller-reference $ID \
| jq .DelegationSet.NameServers

The command shows an output such as the following:

[
  "ns-1924.awsdns-48.co.uk",
  "ns-501.awsdns-62.com",
  "ns-1259.awsdns-29.org",
  "ns-749.awsdns-29.net"
]

Create NS records for the domain with your registrar. Make sure that the NS records can be resolved using a utility like dig web interface. A sample output would look like the following:

The bottom of the screenshot shows NS records configured for the subdomain.

Download and run the Tectonic installer

Download the Tectonic installer (version 1.7.1) and extract it. The latest installer can always be found at coreos.com/tectonic. Start the installer:

./tectonic/tectonic-installer/$PLATFORM/installer

Replace $PLATFORM with either darwin or linux. The installer opens your default browser and prompts you to select the cloud provider. Choose Amazon Web Services as the platform. Choose Next Step.

Specify the Access Key ID and Secret Access Key for the IAM role that you created earlier. This allows the installer to create resources required for the Kubernetes cluster. This also gives the installer full access to your AWS account. Alternatively, to protect the integrity of your main AWS credentials, use a temporary session token to generate temporary credentials.

You also need to choose a region in which to install the cluster. For the purpose of this post, I chose a region close to where I live, Northern California. Choose Next Step.

Give your cluster a name. This name is part of the static namespace for the master and the address of the console.

To enable in-place update to the Kubernetes cluster, select the checkbox next to Automated Updates. It also enables update to the etcd and Prometheus operators. This feature may become a default in future releases.

Choose Upload “tectonic-license.txt” and upload the previously downloaded license file.

Choose Upload “config.json” and upload the previously downloaded pull secret file. Choose Next Step.

Let the installer generate a CA certificate and key. In this case, the browser may not recognize this certificate, which I discuss later in the post. Alternatively, you can provide a CA certificate and a key in PEM format issued by an authorized certificate authority. Choose Next Step.

Use the SSH key for the region specified earlier. You also have an option to generate a new key. This allows you to later connect using SSH into the Amazon EC2 instances provisioned by the cluster. Here is the command that can be used to log in:

ssh –i <key> [email protected]<ec2-instance-ip>

Choose Next Step.

Define the number and instance type of master and worker nodes. In this case, create a 6 nodes cluster. Make sure that the worker nodes have enough processing power and memory to run the containers.

An etcd cluster is used as persistent storage for all of Kubernetes API objects. This cluster is required for the Kubernetes cluster to operate. There are three ways to use the etcd cluster as part of the Tectonic installer:

  • (Default) Provision the cluster using EC2 instances. Additional EC2 instances are used in this case.
  • Use an alpha support for cluster provisioning using the etcd operator. The etcd operator is used for automated operations of the etcd master nodes for the cluster itself, in addition to for etcd instances that are created for application usage. The etcd cluster is provisioned within the Tectonic installer.
  • Bring your own pre-provisioned etcd cluster.

Use the first option in this case.

For more information about choosing the appropriate instance type, see the etcd hardware recommendation. Choose Next Step.

Specify the networking options. The installer can create a new public VPC or use a pre-existing public or private VPC. Make sure that the VPC requirements are met for an existing VPC.

Give a DNS name for the cluster. Choose the domain for which the Route 53 hosted zone was configured earlier, such as tectonic.kubernetes-aws.io. Multiple clusters may be created under a single domain. The cluster name and the DNS name would typically match each other.

To select the CIDR range, choose Show Advanced Settings. You can also choose the Availability Zones for the master and worker nodes. By default, the master and worker nodes are spread across multiple Availability Zones in the chosen region. This makes the cluster highly available.

Leave the other values as default. Choose Next Step.

Specify an email address and password to be used as credentials to log in to the console. Choose Next Step.

At any point during the installation, you can choose Save progress. This allows you to save configurations specified in the installer. This configuration file can then be used to restore progress in the installer at a later point.

To start the cluster installation, choose Submit. At another time, you can download the Terraform assets by choosing Manually boot. This allows you to boot the cluster later.

The logs from the Terraform scripts are shown in the installer. When the installation is complete, the console shows that the Terraform scripts were successfully applied, the domain name was resolved successfully, and that the console has started. The domain works successfully if the DNS resolution worked earlier, and it’s the address where the console is accessible.

Choose Download assets to download assets related to your cluster. It contains your generated CA, kubectl configuration file, and the Terraform state. This download is an important step as it allows you to delete the cluster later.

Choose Next Step for the final installation screen. It allows you to access the Tectonic console, gives you instructions about how to configure kubectl to manage this cluster, and finally deploys an application using kubectl.

Choose Go to my Tectonic Console. In our case, it is also accessible at http://cluster.tectonic.kubernetes-aws.io/.

As I mentioned earlier, the browser does not recognize the self-generated CA certificate. Choose Advanced and connect to the console. Enter the login credentials specified earlier in the installer and choose Login.

The Kubernetes upstream and console version are shown under Software Details. Cluster health shows All systems go and it means that the API server and the backend API can be reached.

To view different Kubernetes resources in the cluster choose, the resource in the left navigation bar. For example, all deployments can be seen by choosing Deployments.

By default, resources in the all namespace are shown. Other namespaces may be chosen by clicking on a menu item on the top of the screen. Different administration tasks such as managing the namespaces, getting list of the nodes and RBAC can be configured as well.

Download and run Kubectl

Kubectl is required to manage the Kubernetes cluster. The latest version of kubectl can be downloaded using the following command:

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/darwin/amd64/kubectl

It can also be conveniently installed using the Homebrew package manager. To find and access a cluster, Kubectl needs a kubeconfig file. By default, this configuration file is at ~/.kube/config. This file is created when a Kubernetes cluster is created from your machine. However, in this case, download this file from the console.

In the console, choose admin, My Account, Download Configuration and follow the steps to download the kubectl configuration file. Move this file to ~/.kube/config. If kubectl has already been used on your machine before, then this file already exists. Make sure to take a backup of that file first.

Now you can run the commands to view the list of deployments:

~ $ kubectl get deployments --all-namespaces
NAMESPACE         NAME                                    DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
kube-system       etcd-operator                           1         1         1            1           43m
kube-system       heapster                                1         1         1            1           40m
kube-system       kube-controller-manager                 3         3         3            3           43m
kube-system       kube-dns                                1         1         1            1           43m
kube-system       kube-scheduler                          3         3         3            3           43m
tectonic-system   container-linux-update-operator         1         1         1            1           40m
tectonic-system   default-http-backend                    1         1         1            1           40m
tectonic-system   kube-state-metrics                      1         1         1            1           40m
tectonic-system   kube-version-operator                   1         1         1            1           40m
tectonic-system   prometheus-operator                     1         1         1            1           40m
tectonic-system   tectonic-channel-operator               1         1         1            1           40m
tectonic-system   tectonic-console                        2         2         2            2           40m
tectonic-system   tectonic-identity                       2         2         2            2           40m
tectonic-system   tectonic-ingress-controller             1         1         1            1           40m
tectonic-system   tectonic-monitoring-auth-alertmanager   1         1         1            1           40m
tectonic-system   tectonic-monitoring-auth-prometheus     1         1         1            1           40m
tectonic-system   tectonic-prometheus-operator            1         1         1            1           40m
tectonic-system   tectonic-stats-emitter                  1         1         1            1           40m

This output is similar to the one shown in the console earlier. Now, this kubectl can be used to manage your resources.

Upgrade the Kubernetes cluster

Tectonic allows the in-place upgrade of the cluster. This is an experimental feature as of this release. The clusters can be updated either automatically, or with manual approval.

To perform the update, choose Administration, Cluster Settings. If an earlier Tectonic installer, version 1.6.2 in this case, is used to install the cluster, then this screen would look like the following:

Choose Check for Updates. If any updates are available, choose Start Upgrade. After the upgrade is completed, the screen is refreshed.

This is an experimental feature in this release and so should only be used on clusters that can be easily replaced. This feature may become a fully supported in a future release. For more information about the upgrade process, see Upgrading Tectonic & Kubernetes.

Delete the Kubernetes cluster

Typically, the Kubernetes cluster is a long-running cluster to serve your applications. After its purpose is served, you may delete it. It is important to delete the cluster as this ensures that all resources created by the cluster are appropriately cleaned up.

The easiest way to delete the cluster is using the assets downloaded in the last step of the installer. Extract the downloaded zip file. This creates a directory like <cluster-name>_TIMESTAMP. In that directory, give the following command to delete the cluster:

TERRAFORM_CONFIG=$(pwd)/.terraformrc terraform destroy --force

This destroys the cluster and all associated resources.

You may have forgotten to download the assets. There is a copy of the assets in the directory tectonic/tectonic-installer/darwin/clusters. In this directory, another directory with the name <cluster-name>_TIMESTAMP contains your assets.

Conclusion

This post explained how to manage Kubernetes clusters using the CoreOS Tectonic graphical installer.  For more details, see Graphical Installer with AWS. If the installation does not succeed, see the helpful Troubleshooting tips. After the cluster is created, see the Tectonic tutorials to learn how to deploy, scale, version, and delete an application.

Future posts in this series will explain other ways of creating and running a Kubernetes cluster on AWS.

Arun

Strategies for Backing Up Windows Computers

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/strategies-for-backing-up-windows-computers/

Windows 7, Windows 8, Windows 10 logos

There’s a little company called Apple making big announcements this week, but about 45% of you are on Windows machines, so we thought it would be a good idea to devote a blog post today to Windows users and the options they have for backing up Windows computers.

We’ll be talking about the various options for backing up Windows desktop OS’s 7, 8, and 10, and Windows servers. We’ve written previously about this topic in How to Back Up Windows, and Computer Backup Options, but we’ll be covering some new topics and ways to combine strategies in this post. So, if you’re a Windows user looking for shelter from all the Apple hoopla, welcome to our Apple Announcement Day Windows Backup Day post.

Windows laptop

First, Let’s Talk About What We Mean by Backup

This might seem to our readers like an unneeded appetizer on the way to the main course of our post, but we at Backblaze know that people often mean very different things when they use backup and related terms. Let’s start by defining what we mean when we say backup, cloud storage, sync, and archive.

Backup
A backup is an active copy of the system or files that you are using. It is distinguished from an archive, which is the storing of data that is no longer in active use. Backups fall into two main categories: file and image. File backup software will back up whichever files you designate by either letting you include files you wish backed up or by excluding files you don’t want backed up, or both. An image backup, sometimes called a disaster recovery backup or a system clone, is useful if you need to recreate your system on a new drive or computer.
The first backup generally will be a full backup of all files. After that, the backup will be incremental, meaning that only files that have been changed since the full backup will be added. Often, the software will keep changed versions of the files for some period of time, so you can maintain a number of previous revisions of your files in case you wish to return to something in an earlier version of your file.
The destination for your backup could be another drive on your computer, an attached drive, a network-attached drive (NAS), or the cloud.
Cloud Storage
Cloud storage vendors supply data storage just as a utility company supplies power, gas, or water. Cloud storage can be used for data backups, but it can also be used for data archives, application data, records, or libraries of photos, videos, and other media.
You contract with the service for storing any type of data, and the storage location is available to you via the internet. Cloud storage providers generally charge by some combination of data ingress, egress, and the amount of data stored.
Sync
File sync is useful for files that you wish to have access to from different places or computers, or for files that you wish to share with others. While sync has its uses, it has limitations for keeping files safe and how much it could cost you to store large amounts of data. As opposed to backup, which keeps revision of files, sync is designed to keep two or more locations exactly the same. Sync costs are based on how much data you sync and can get expensive for large amounts of data.
Archive
A data archive is for data that is no longer in active use but needs to be saved, and may or may not ever be retrieved again. In old-style storage parlance, it is called cold storage. An archive could be stored with a cloud storage provider, or put on a hard drive or flash drive that you disconnect and put in the closet, or mail to your brother in Idaho.

What’s the Best Strategy for Backing Up?

Now that we’ve got our terminology clear, let’s talk backup strategies for Windows.

At Backblaze, we advocate the 3-2-1 strategy for safeguarding your data, which means that you should maintain three copies of any valuable data — two copies stored locally and one stored remotely. I follow this strategy at home by working on the active data on my Windows 10 desktop computer (copy one), which is backed up to a Drobo RAID device attached via USB (copy two), and backing up the desktop to Backblaze’s Personal Backup in the cloud (copy three). I also keep an image of my primary disk on a separate drive and frequently update it using Windows 10’s image tool.

I use Dropbox for sharing specific files I am working on that I might wish to have access to when I am traveling or on another computer. Once my subscription with Dropbox expires, I’ll use the latest release of Backblaze that has individual file preview with sharing built-in.

Before you decide which backup strategy will work best for your situation, you’ll need to ask yourself a number of questions. These questions include where you wish to store your backups, whether you wish to supply your own storage media, whether the backups will be manual or automatic, and whether limited or unlimited data storage will work best for you.

Strategy 1 — Back Up to a Local or Attached Drive

The first copy of the data you are working on is often on your desktop or laptop. You can create a second copy of your data on another drive or directory on your computer, or copy the data to a drive directly attached to your computer, such as via USB.

external hard drive and RAID NAS devices

Windows has built-in tools for both file and image level backup. Depending on which version of Windows you use, these tools are called Backup and Restore, File History, or Image. These tools enable you to set a schedule for automatic backups, which ensures that it is done regularly. You also have the choice to use Windows Explorer (aka File Explorer) to manually copy files to another location. Some external disk drives and USB Flash Drives come with their own backup software, and other backup utilities are available for free or for purchase.

Windows Explorer File History screenshot

This is a supply-your-own media solution, meaning that you need to have a hard disk or other medium available of sufficient size to hold all your backup data. When a disk becomes full, you’ll need to add a disk or swap out the full disk to continue your backups.

We’ve written previously on this strategy at Should I use an external drive for backup?

Strategy 2 — Back Up to a Local Area Network (LAN)

Computers, servers, and network-attached-storage (NAS) on your local network all can be used for backing up data. Microsoft’s built-in backup tools can be used for this job, as can any utility that supports network protocols such as NFS or SMB/CIFS, which are common protocols that allow shared access to files on a network for Windows and other operatings systems. There are many third-party applications available as well that provide extensive options for managing and scheduling backups and restoring data when needed.

NAS cloud

Multiple computers can be backed up to a single network-shared computer, server, or NAS, which also could then be backed up to the cloud, which rounds out a nice backup strategy, because it covers both local and remote copies of your data. System images of multiple computers on the LAN can be included in these backups if desired.

Again, you are managing the backup media on the local network, so you’ll need to be sure you have sufficient room on the destination drives to store all your backup data.

Strategy 3 — Back Up to Detached Drive at Another Location

You may have have read our recent blog post, Getting Data Archives Out of Your Closet, in which we discuss the practice of filling hard drives and storing them in a closet. Of course, to satisfy the off-site backup guideline, these drives would need to be stored in a closet that’s in a different geographical location than your main computer. If you’re willing to do all the work of copying the data to drives and transporting them to another location, this is a viable option.

stack of hard drives

The only limitation to the amount of backup data is the number of hard drives you are willing to purchase — and maybe the size of your closet.

Strategy 4 — Back Up to the Cloud

Backing up to the cloud has become a popular option for a number of reasons. Internet speeds have made moving large amounts of data possible, and not having to worry about supplying the storage media simplifies choices for users. Additionally, cloud vendors implement features such as data protection, deduplication, and encryption as part of their services that make cloud storage reliable, secure, and efficient. Unlimited cloud storage for data from a single computer is a popular option.

A backup vendor likely will provide a software client that runs on your computer and backs up your data to the cloud in the background while you’re doing other things, such as Backblaze Personal Backup, which has clients for Windows computers, Macintosh computers, and mobile apps for both iOS and Android. For restores, Backblaze users can download one or all of their files for free from anywhere in the world. Optionally, a 128 GB flash drive or 4 TB drive can be overnighted to the customer, with a refund available if the drive is returned.

Storage Pod in the cloud

Backblaze B2 Cloud Storage is an option for those who need capabilities beyond Backblaze’s Personal Backup. B2 provides cloud storage that is priced based on the amount of data the customer uses, and is suitable for long-term data storage. B2 supports integrations with NAS devices, as well as Windows, Macintosh, and Linux computers and servers.

Services such as BackBlaze B2 are often called Cloud Object Storage or IaaS (Infrastructure as a Service), because they provide a complete solution for storing all types of data in partnership with vendors who integrate various solutions for working with B2. B2 has its own API (Application Programming Interface) and CLI (Command-line Interface) to work with B2, but B2 becomes even more powerful when paired with any one of a number of other solutions for data storage and management provided by third parties who offer both hardware and software solutions.

Backing Up Windows Servers

Windows Servers are popular workstations for some users, and provide needed network services for others. They also can be used to store backups from other computers on the network. They, in turn, can be backed up to attached drives or the cloud. While our Personal Backup client doesn’t support Windows servers, our B2 Cloud Storage has a number of integrations with vendors who supply software or hardware for storing data both locally and on B2. We’ve written a number of blog posts and articles that address these solutions, including How to Back Up your Windows Server with B2 and CloudBerry.

Sometimes the Best Strategy is to Mix and Match

The great thing about computers, software, and networks is that there is an endless number of ways to combine them. Our users and hardware and software partners are ingenious in configuring solutions that save data locally, copy it to an attached or network drive, and then store it to the cloud.

image of cloud backup

Among our B2 partners, Synology, CloudBerry Archiware, QNAP, Morro Data, and GoodSync have integrations that allow their NAS devices to store and retrieve data to and from B2 Cloud Storage. For a drag-and-drop experience on the desktop, take a look at CyberDuck, MountainDuck, and Dropshare, which provide users with an easy and interactive way to store and use data in B2.

If you’d like to explore more options for combining software, hardware, and cloud solutions, we invite you to browse the integrations for our many B2 partners.

Have Questions?

Windows versions, tools, and backup terminology all can be confusing, and we know how hard it can be to make sense of all of it. If there’s something we haven’t addressed here, or if you have a question or contribution, please let us know in the comments.

And happy Windows Backup Day! (Just don’t tell Apple.)

The post Strategies for Backing Up Windows Computers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

New Network Load Balancer – Effortless Scaling to Millions of Requests per Second

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-network-load-balancer-effortless-scaling-to-millions-of-requests-per-second/

Elastic Load Balancing (ELB)) has been an important part of AWS since 2009, when it was launched as part of a three-pack that also included Auto Scaling and Amazon CloudWatch. Since that time we have added many features, and also introduced the Application Load Balancer. Designed to support application-level, content-based routing to applications that run in containers, Application Load Balancers pair well with microservices, streaming, and real-time workloads.

Over the years, our customers have used ELB to support web sites and applications that run at almost any scale — from simple sites running on a T2 instance or two, all the way up to complex applications that run on large fleets of higher-end instances and handle massive amounts of traffic. Behind the scenes, ELB monitors traffic and automatically scales to meet demand. This process, which includes a generous buffer of headroom, has become quicker and more responsive over the years and works well even for our customers who use ELB to support live broadcasts, “flash” sales, and holidays. However, in some situations such as instantaneous fail-over between regions, or extremely spiky workloads, we have worked with our customers to pre-provision ELBs in anticipation of a traffic surge.

New Network Load Balancer
Today we are introducing the new Network Load Balancer (NLB). It is designed to handle tens of millions of requests per second while maintaining high throughput at ultra low latency, with no effort on your part. The Network Load Balancer is API-compatible with the Application Load Balancer, including full programmatic control of Target Groups and Targets. Here are some of the most important features:

Static IP Addresses – Each Network Load Balancer provides a single IP address for each VPC subnet in its purview. If you have targets in a subnet in us-west-2a and other targets in a subnet in us-west-2c, NLB will create and manage two IP addresses (one per subnet); connections to that IP address will spread traffic across the instances in the subnet. You can also specify an existing Elastic IP for each subnet for even greater control. With full control over your IP addresses, Network Load Balancer can be used in situations where IP addresses need to be hard-coded into DNS records, customer firewall rules, and so forth.

Zonality – The IP-per-subnet feature reduces latency with improved performance, improves availability through isolation and fault tolerance and makes the use of Network Load Balancers transparent to your client applications. Network Load Balancers also attempt to route a series of requests from a particular source to targets in a single subnet while still allowing automatic failover.

Source Address Preservation – With Network Load Balancer, the original source IP address and source ports for the incoming connections remain unmodified, so application software need not support X-Forwarded-For, proxy protocol, or other workarounds. This also means that normal firewall rules, including VPC Security Groups, can be used on targets.

Long-running Connections – NLB handles connections with built-in fault tolerance, and can handle connections that are open for months or years, making them a great fit for IoT, gaming, and messaging applications.

Failover – Powered by Route 53 health checks, NLB supports failover between IP addresses within and across regions.

Creating a Network Load Balancer
I can create a Network Load Balancer opening up the EC2 Console, selecting Load Balancers, and clicking on Create Load Balancer:

I choose Network Load Balancer and click on Create, then enter the details. I can choose an Elastic IP address for each subnet in the target VPC and I can tag the Network Load Balancer:

Then I click on Configure Routing and create a new target group. I enter a name, and then choose the protocol and port. I can also set up health checks that go to the traffic port or to the alternate of my choice:

Then I click on Register Targets and the EC2 instances that will receive traffic, and click on Add to registered:

I make sure that everything looks good and then click on Create:

The state of my new Load Balancer is provisioning, switching to active within a minute or so:

For testing purposes, I simply grab the DNS name of the Load Balancer from the console (in practice I would use Amazon Route 53 and a more friendly name):

Then I sent it a ton of traffic (I intended to let it run for just a second or two but got distracted and it created a huge number of processes, so this was a happy accident):

$ while true;
> do
>   wget http://nlb-1-6386cc6bf24701af.elb.us-west-2.amazonaws.com/phpinfo2.php &
> done

A more disciplined test would use a tool like Bees with Machine Guns, of course!

I took a quick break to let some traffic flow and then checked the CloudWatch metrics for my Load Balancer, finding that it was able to handle the sudden onslaught of traffic with ease:

I also looked at my EC2 instances to see how they were faring under the load (really well, it turns out):

It turns out that my colleagues did run a more disciplined test than I did. They set up a Network Load Balancer and backed it with an Auto Scaled fleet of EC2 instances. They set up a second fleet composed of hundreds of EC2 instances, each running Bees with Machine Guns and configured to generate traffic with highly variable request and response sizes. Beginning at 1.5 million requests per second, they quickly turned the dial all the way up, reaching over 3 million requests per second and 30 Gbps of aggregate bandwidth before maxing out their test resources.

Choosing a Load Balancer
As always, you should consider the needs of your application when you choose a load balancer. Here are some guidelines:

Network Load Balancer (NLB) – Ideal for load balancing of TCP traffic, NLB is capable of handling millions of requests per second while maintaining ultra-low latencies. NLB is optimized to handle sudden and volatile traffic patterns while using a single static IP address per Availability Zone.

Application Load Balancer (ALB) – Ideal for advanced load balancing of HTTP and HTTPS traffic, ALB provides advanced request routing that supports modern application architectures, including microservices and container-based applications.

Classic Load Balancer (CLB) – Ideal for applications that were built within the EC2-Classic network.

For a side-by-side feature comparison, see the Elastic Load Balancer Details table.

If you are currently using a Classic Load Balancer and would like to migrate to a Network Load Balancer, take a look at our new Load Balancer Copy Utility. This Python tool will help you to create a Network Load Balancer with the same configuration as an existing Classic Load Balancer. It can also register your existing EC2 instances with the new load balancer.

Pricing & Availability
Like the Application Load Balancer, pricing is based on Load Balancer Capacity Units, or LCUs. Billing is $0.006 per LCU, based on the highest value seen across the following dimensions:

  • Bandwidth – 1 GB per LCU.
  • New Connections – 800 per LCU.
  • Active Connections – 100,000 per LCU.

Most applications are bandwidth-bound and should see a cost reduction (for load balancing) of about 25% when compared to Application or Classic Load Balancers.

Network Load Balancers are available today in all AWS commercial regions except China (Beijing), supported by AWS CloudFormation, Auto Scaling, and Amazon ECS.

Jeff;

 

Disabling Intel Hyper-Threading Technology on Amazon EC2 Windows Instances

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/compute/disabling-intel-hyper-threading-technology-on-amazon-ec2-windows-instances/

In a prior post, Disabling Intel Hyper-Threading on Amazon Linux, I investigated how the Linux kernel enumerates CPUs. I also discussed the options to disable Intel Hyper-Threading (HT Technology) in Amazon Linux running on Amazon EC2.

In this post, I do the same for Microsoft Windows Server 2016 running on EC2 instances. I begin with a quick review of HT Technology and the reasons you might want to disable it. I also recommend that you take a moment to review the prior post for a more thorough foundation.

HT Technology

HT Technology makes a single physical processor appear as multiple logical processors. Each core in an Intel Xeon processor has two threads of execution. Most of the time, these threads can progress independently; one thread executing while the other is waiting on a relatively slow operation (for example, reading from memory) to occur. However, the two threads do share resources and occasionally one thread is forced to wait while the other is executing.

There a few unique situations where disabling HT Technology can improve performance. One example is high performance computing (HPC) workloads that rely heavily on floating point operations. In these rare cases, it can be advantageous to disable HT Technology. However, these cases are rare, and for the overwhelming majority of workloads you should leave it enabled. I recommend that you test with and without HT Technology enabled, and only disable threads if you are sure it will improve performance.

Exploring HT Technology on Microsoft Windows

Here’s how Microsoft Windows enumerates CPUs. As before, I am running these examples on an m4.2xlarge. I also chose to run Windows Server 2016, but you can walk through these exercises on any version of Windows. Remember that the m4.2xlarge has eight vCPUs, and each vCPU is a thread of an Intel Xeon core. Therefore, the m4.2xlarge has four cores, each of which run two threads, resulting in eight vCPUs.

Windows does not have a built-in utility to examine CPU configuration, but you can download the Sysinternals coreinfo utility from Microsoft’s website. This utility provides useful information about the system CPU and memory topology. For this walkthrough, you enumerate the individual CPUs, which you can do by running coreinfo -c. For example:

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**------ Physical Processor 0 (Hyperthreaded)
--**---- Physical Processor 1 (Hyperthreaded)
----**-- Physical Processor 2 (Hyperthreaded)
------** Physical Processor 3 (Hyperthreaded)

As you can see from the screenshot, the coreinfo utility displays a table where each row is a physical core and each column is a logical CPU. In other words, the two asterisks on the first line indicate that CPU 0 and CPU 1 are the two threads in the first physical core. Therefore, my m4.2xlarge has for four physical processors and each processor has two threads resulting in eight total CPUs, just as expected.

It is interesting to note that Windows Server 2016 enumerates CPUs in a different order than Linux. Remember from the prior post that Linux enumerated the first thread in each core, followed by the second thread in each core. You can see from the output earlier that Windows Server 2016, enumerates both threads in the first core, then both threads in the second core, and so on. The diagram below shows the relationship of CPUs to cores and threads in both operating systems.

In the Linux post, I disabled CPUs 4–6, leaving one thread per core, and effectively disabling HT Technology. You can see from the diagram that you must disable the odd-numbered threads (that is, 1, 3, 5, and 7) to achieve the same result in Windows. Here’s how to do that.

Disabling HT Technology on Microsoft Windows

In Linux, you can globally disable CPUs dynamically. In Windows, there is no direct equivalent that I could find, but there are a few alternatives.

First, you can disable CPUs using the msconfig.exe tool. If you choose Boot, Advanced Options, you have the option to set the number of processors. In the example below, I limit my m4.2xlarge to four CPUs. Restart for this change to take effect.

Unfortunately, Windows does not disable hyperthreaded CPUs first and then real cores, as Linux does. As you can see in the following output, coreinfo reports that my c4.2xlarge has two real cores and four hyperthreads, after rebooting. Msconfig.exe is useful for disabling cores, but it does not allow you to disable HT Technology.

Note: If you have been following along, you can re-enable all your CPUs by unselecting the Number of processors check box and rebooting your system.

 

C:\Users\Administrator >coreinfo -c

Coreinfo v3.31 - Dump information on system CPU and memory topology
Copyright (C) 2008-2014 Mark Russinovich
Sysinternals - www.sysinternals.com

Logical to Physical Processor Map:
**-- Physical Processor 0 (Hyperthreaded)
--** Physical Processor 1 (Hyperthreaded)

While you cannot disable HT Technology systemwide, Windows does allow you to associate a particular process with one or more CPUs. Microsoft calls this, “processor affinity”. To see an example, use the following steps.

  1. Launch an instance of Notepad.
  2. Open Windows Task Manager and choose Processes.
  3. Open the context (right click) menu on notepad.exe and choose Set Affinity….

This brings up the Processor Affinity dialog box.

As you can see, all the CPUs are allowed to run this instance of notepad.exe. You can uncheck a few CPUs to exclude them. Windows is smart enough to allow any scheduled operations to continue to completion on disabled CPUs. It then saves its state at the next scheduling event, and resumes those operations on another CPU. To ensure that only one thread in each core is able to run a process, you uncheck every other core. This effectively disables HT Technology for this process. For example:

Of course, this can be tedious when you have a large number of cores. Remember that the x1.32xlarge has 128 CPUs. Luckily, you can set the affinity of a running process from PowerShell using the Get-Process cmdlet. For example:

PS C:\&gt; (Get-Process -Name 'notepad').ProcessorAffinity = 0x55;

The ProcessorAffinity attribute takes a bitmask in hexadecimal format. 0x55 in hex is equivalent to 01010101 in binary. Think of the binary encoding as 1=enabled and 0=disabled. This is slightly confusing, but we work left to right so that CPU 0 is the rightmost bit and CPU 7 is the leftmost bit. Therefore, 01010101 means that the first thread in each CPU is enabled just as it was in the diagram earlier.

The calculator built into Windows includes a “programmer view” that helps you convert from hexadecimal to binary. In addition, the ProcessorAffinity attribute is a 64-bit number. Therefore, you can only configure the processor affinity on systems up to 64 CPUs. At the moment, only the x1.32xlarge has more than 64 vCPUs.

In the preceding examples, you changed the processor affinity of a running process. Sometimes, you want to start a process with the affinity already configured. You can do this using the start command. The start command includes an affinity flag that takes a hexadecimal number like the PowerShell example earlier.

C:\Users\Administrator&gt;start /affinity 55 notepad.exe

It is interesting to note that a child process inherits the affinity from its parent. For example, the following commands create a batch file that launches Notepad, and starts the batch file with the affinity set. If you examine the instance of Notepad launched by the batch file, you see that the affinity has been applied to as well.

C:\Users\Administrator&gt;echo notepad.exe > test.bat
C:\Users\Administrator&gt;start /affinity 55 test.bat

This means that you can set the affinity of your task scheduler and any tasks that the scheduler starts inherits the affinity. So, you can disable every other thread when you launch the scheduler and effectively disable HT Technology for all of the tasks as well. Be sure to test this point, however, as some schedulers override the normal inheritance behavior and explicitly set processor affinity when starting a child process.

Conclusion

While the Windows operating system does not allow you to disable logical CPUs, you can set processor affinity on individual processes. You also learned that Windows Server 2016 enumerates CPUs in a different order than Linux. Therefore, you can effectively disable HT Technology by restricting a process to every other CPU. Finally, you learned how to set affinity of both new and running processes using Task Manager, PowerShell, and the start command.

Note: this technical approach has nothing to do with control over software licensing, or licensing rights, which are sometimes linked to the number of “CPUs” or “cores.” For licensing purposes, those are legal terms, not technical terms. This post did not cover anything about software licensing or licensing rights.

If you have questions or suggestions, please comment below.

Foxtel Targets 128 Torrent & Streaming Domains For Blocking Down Under

Post Syndicated from Andy original https://torrentfreak.com/foxtel-targets-128-torrent-streaming-domains-for-blocking-down-under-170808/

In 2015, Australia passed controversial legislation which allows ‘pirate’ sites located on servers overseas to be blocked at the ISP level.

“These offshore sites are not operated by noble spirits fighting for the freedom of the internet, they are run by criminals who profit from stealing other people’s creative endeavors,” commented then Foxtel chief executive Richard Freudenstein.

Before, during and after its introduction, Foxtel has positioned itself as a keen supporter of the resulting Section 115a of the Copyright Act. And in December 2016, with the law firmly in place, it celebrated success after obtaining a blocking injunction against The Pirate Bay, Torrentz, TorrentHound and isoHunt.

In May, Foxtel filed a new application, demanding that almost 50 local ISPs block what was believed to be a significant number of ‘pirate’ sites not covered by last year’s order.

Today the broadcasting giant was back in Federal Court, Sydney, to have this second application heard under Section 115a. It was revealed that the application contains 128 domains, each linked to movie and TV piracy.

According to ComputerWorld, the key sites targeted are as follows: YesMovies, Vumoo, LosMovies, CartoonHD, Putlocker, Watch Series 1, Watch Series 2, Project Free TV 1, Project Free TV 2, Watch Episodes, Watch Episode Series, Watch TV Series, The Dare Telly, Putlocker9.is, Putlocker9.to, Torlock and 1337x.

The Foxtel application targets both torrent and streaming sites but given the sample above, it seems that the latter is currently receiving the most attention. Streaming sites are appearing at a rapid rate and can even be automated to some extent, so this battle could become extremely drawn out.

Indeed, Justice Burley, who presided over the case this morning, described the website-blocking process (which necessarily includes targeting mirrors, proxies and replacement domains) as akin to “whack-a-mole”.

“Foxtel sees utility in orders of this nature,” counsel for Foxtel commented in response. “It’s important to block these sites.”

In presenting its application, Foxtel conducted live demonstrations of Yes Movies, Watch Series, 1337x, and Putlocker. It focused on the Australian prison drama series Wentworth, which has been running on Foxtel since 2013, but also featured tests of Game of Thrones.

Justice Burley told the court that since he’s a fan of the series, a spoiler-free piracy presentation would be appreciated. If the hearing had taken place a few days earlier, spoilers may have been possible. Last week, the latest episode of the show leaked onto the Internet from an Indian source before its official release.

Justice Burley’s decision will be handed down at a later date, but it’s unlikely there will be any serious problems with Foxtel’s application. After objecting to many aspects of blocking applications in the past, Australia’s ISPs no longer appear during these hearings. They are now paid AU$50 per domain blocked by companies such as Foxtel and play little more than a technical role in the process.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Create Multiple Builds from the Same Source Using Different AWS CodeBuild Build Specification Files

Post Syndicated from Prakash Palanisamy original https://aws.amazon.com/blogs/devops/create-multiple-builds-from-the-same-source-using-different-aws-codebuild-build-specification-files/

In June 2017, AWS CodeBuild announced you can now specify an alternate build specification file name or location in an AWS CodeBuild project.

In this post, I’ll show you how to use different build specification files in the same repository to create different builds. You’ll find the source code for this post in our GitHub repo.

Requirements

The AWS CLI must be installed and configured.

Solution Overview

I have created a C program (cbsamplelib.c) that will be used to create a shared library and another utility program (cbsampleutil.c) to use that library. I’ll use a Makefile to compile these files.

I need to put this sample application in RPM and DEB packages so end users can easily deploy them. I have created a build specification file for RPM. It will use make to compile this code and the RPM specification file (cbsample.rpmspec) configured in the build specification to create the RPM package. Similarly, I have created a build specification file for DEB. It will create the DEB package based on the control specification file (cbsample.control) configured in this build specification.

RPM Build Project:

The following build specification file (buildspec-rpm.yml) uses build specification version 0.2. As described in the documentation, this version has different syntax for environment variables. This build specification includes multiple phases:

  • As part of the install phase, the required packages is installed using yum.
  • During the pre_build phase, the required directories are created and the required files, including the RPM build specification file, are copied to the appropriate location.
  • During the build phase, the code is compiled, and then the RPM package is created based on the RPM specification.

As defined in the artifact section, the RPM file will be uploaded as a build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample*.rpm
  discard-paths: yes

Using cb-centos-project.json as a reference, create the input JSON file for the CLI command. This project uses an AWS CodeCommit repository named codebuild-multispec and a file named buildspec-rpm.yml as the build specification file. To create the RPM package, we need to specify a custom image name. I’m using the latest CentOS 7 image available in the Docker Hub. I’m using a role named CodeBuildServiceRole. It contains permissions similar to those defined in CodeBuildServiceRole.json. (You need to change the resource fields in the policy, as appropriate.)

{
    "name": "rpm-build-project",
    "description": "Project which will build RPM from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-rpm.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "centos:7",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "RPM Demo Build"
        }
    ]
}

After the cli-input-json file is ready, execute the following command to create the build project.

$ aws codebuild create-project --name CodeBuild-RPM-Demo --cli-input-json file://cb-centos-project.json

{
    "project": {
        "name": "CodeBuild-RPM-Demo", 
        "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole", 
        "tags": [
            {
                "value": "RPM Demo Build", 
                "key": "Name"
            }
        ], 
        "artifacts": {
            "namespaceType": "NONE", 
            "packaging": "NONE", 
            "type": "S3", 
            "location": "codebuild-demo-artifact-repository", 
            "name": "CodeBuild-RPM-Demo"
        }, 
        "lastModified": 1500559811.13, 
        "timeoutInMinutes": 15, 
        "created": 1500559811.13, 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:project/CodeBuild-RPM-Demo", 
        "description": "Project which will build RPM from the source."
    }
}

When the project is created, run the following command to start the build. After the build has started, get the build ID. You can use the build ID to get the status of the build.

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500560156.761, 
        "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
        "arn": "arn:aws:codebuild:eu-west-1: 012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    }
}

$ aws codebuild list-builds-for-project --project-name CodeBuild-RPM-Demo
{
    "ids": [
        "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
    ]
}

$ aws codebuild batch-get-builds --ids CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc
{
    "buildsNotFound": [], 
    "builds": [
        {
            "buildComplete": true, 
            "phases": [
                {
                    "phaseStatus": "SUCCEEDED", 
                    "endTime": 1500560157.164, 
                    "phaseType": "SUBMITTED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560156.761
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PROVISIONING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 24, 
                    "startTime": 1500560157.164, 
                    "endTime": 1500560182.066
                }, 
                {
                    "contexts": [], 
                    "phaseType": "DOWNLOAD_SOURCE", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 15, 
                    "startTime": 1500560182.066, 
                    "endTime": 1500560197.906
                }, 
                {
                    "contexts": [], 
                    "phaseType": "INSTALL", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 19, 
                    "startTime": 1500560197.906, 
                    "endTime": 1500560217.515
                }, 
                {
                    "contexts": [], 
                    "phaseType": "PRE_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.515, 
                    "endTime": 1500560217.662
                }, 
                {
                    "contexts": [], 
                    "phaseType": "BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.662, 
                    "endTime": 1500560217.995
                }, 
                {
                    "contexts": [], 
                    "phaseType": "POST_BUILD", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560217.995, 
                    "endTime": 1500560218.074
                }, 
                {
                    "contexts": [], 
                    "phaseType": "UPLOAD_ARTIFACTS", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 0, 
                    "startTime": 1500560218.074, 
                    "endTime": 1500560218.542
                }, 
                {
                    "contexts": [], 
                    "phaseType": "FINALIZING", 
                    "phaseStatus": "SUCCEEDED", 
                    "durationInSeconds": 4, 
                    "startTime": 1500560218.542, 
                    "endTime": 1500560223.128
                }, 
                {
                    "phaseType": "COMPLETED", 
                    "startTime": 1500560223.128
                }
            ], 
            "logs": {
                "groupName": "/aws/codebuild/CodeBuild-RPM-Demo", 
                "deepLink": "https://console.aws.amazon.com/cloudwatch/home?region=eu-west-1#logEvent:group=/aws/codebuild/CodeBuild-RPM-Demo;stream=57a36755-4d37-4b08-9c11-1468e1682abc", 
                "streamName": "57a36755-4d37-4b08-9c11-1468e1682abc"
            }, 
            "artifacts": {
                "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
            }, 
            "projectName": "CodeBuild-RPM-Demo", 
            "timeoutInMinutes": 15, 
            "initiator": "prakash", 
            "buildStatus": "SUCCEEDED", 
            "environment": {
                "computeType": "BUILD_GENERAL1_SMALL", 
                "privilegedMode": false, 
                "image": "centos:7", 
                "type": "LINUX_CONTAINER", 
                "environmentVariables": []
            }, 
            "source": {
                "buildspec": "buildspec-rpm.yml", 
                "type": "CODECOMMIT", 
                "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
            }, 
            "currentPhase": "COMPLETED", 
            "startTime": 1500560156.761, 
            "endTime": 1500560223.128, 
            "id": "CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc", 
            "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:57a36755-4d37-4b08-9c11-1468e1682abc"
        }
    ]
}

DEB Build Project:

In this project, we will use the build specification file named buildspec-deb.yml. Like the RPM build project, this specification includes multiple phases. Here I use a Debian control file to create the package in DEB format. After a successful build, the DEB package will be uploaded as build artifact.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - apt-get install gcc make -y
  pre_build:
    commands:
      - mkdir -p ./cbsample-$build_version/DEBIAN
      - mkdir -p ./cbsample-$build_version/usr/lib
      - mkdir -p ./cbsample-$build_version/usr/include
      - mkdir -p ./cbsample-$build_version/usr/bin
      - cp -f cbsample.control ./cbsample-$build_version/DEBIAN/control
  build:
    commands:
      - echo "Building the application"
      - make
      - cp libcbsamplelib.so ./cbsample-$build_version/usr/lib
      - cp cbsamplelib.h ./cbsample-$build_version/usr/include
      - cp cbsampleutil ./cbsample-$build_version/usr/bin
      - chmod +x ./cbsample-$build_version/usr/bin/cbsampleutil
      - dpkg-deb --build ./cbsample-$build_version

artifacts:
  files:
    - cbsample-*.deb

Here we use cb-ubuntu-project.json as a reference to create the CLI input JSON file. This project uses the same AWS CodeCommit repository (codebuild-multispec) but a different buildspec file in the same repository (buildspec-deb.yml). We use the default CodeBuild image to create the DEB package. We use the same IAM role (CodeBuildServiceRole).

{
    "name": "deb-build-project",
    "description": "Project which will build DEB from the source.",
    "source": {
        "type": "CODECOMMIT",
        "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec",
        "buildspec": "buildspec-deb.yml"
    },
    "artifacts": {
        "type": "S3",
        "location": "codebuild-demo-artifact-repository"
    },
    "environment": {
        "type": "LINUX_CONTAINER",
        "image": "aws/codebuild/ubuntu-base:14.04",
        "computeType": "BUILD_GENERAL1_SMALL"
    },
    "serviceRole": "arn:aws:iam::012345678912:role/service-role/CodeBuildServiceRole",
    "timeoutInMinutes": 15,
    "encryptionKey": "arn:aws:kms:eu-west-1:012345678912:alias/aws/s3",
    "tags": [
        {
            "key": "Name",
            "value": "Debian Demo Build"
        }
    ]
}

Using the CLI input JSON file, create the project, start the build, and check the status of the project.

$ aws codebuild create-project --name CodeBuild-DEB-Demo --cli-input-json file://cb-ubuntu-project.json

$ aws codebuild list-builds-for-project --project-name CodeBuild-DEB-Demo

$ aws codebuild batch-get-builds --ids CodeBuild-DEB-Demo:e535c4b0-7067-4fbe-8060-9bb9de203789

After successful completion of the RPM and DEB builds, check the S3 bucket configured in the artifacts section for the build packages. Build projects will create a directory in the name of the build project and copy the artifacts inside it.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-DEB-Demo/
2017-07-20 16:37:22       5420 cbsample-0.1.deb

Override Buildspec During Build Start:

It’s also possible to override the build specification file of an existing project when starting a build. If we want to create the libs RPM package instead of the whole RPM, we will use the build specification file named buildspec-libs-rpm.yml. This build specification file is similar to the earlier RPM build. The only difference is that it uses a different RPM specification file to create libs RPM.

version: 0.2

env:
  variables:
    build_version: "0.1"

phases:
  install:
    commands:
      - yum install rpm-build make gcc glibc -y
  pre_build:
    commands:
      - curr_working_dir=`pwd`
      - mkdir -p ./{RPMS,SRPMS,BUILD,SOURCES,SPECS,tmp}
      - filename="cbsample-libs-$build_version"
      - echo $filename
      - mkdir -p $filename
      - cp ./*.c ./*.h Makefile $filename
      - tar -zcvf /root/$filename.tar.gz $filename
      - cp /root/$filename.tar.gz ./SOURCES/
      - cp cbsample-libs.rpmspec ./SPECS/
  build:
    commands:
      - echo "Triggering RPM build"
      - rpmbuild --define "_topdir `pwd`" -ba SPECS/cbsample-libs.rpmspec
      - cd $curr_working_dir

artifacts:
  files:
    - RPMS/x86_64/cbsample-libs*.rpm
  discard-paths: yes

Using the same RPM build project that we created earlier, start a new build and set the value of the `–buildspec-override` parameter to buildspec-libs-rpm.yml .

$ aws codebuild start-build --project-name CodeBuild-RPM-Demo --buildspec-override buildspec-libs-rpm.yml
{
    "build": {
        "buildComplete": false, 
        "initiator": "prakash", 
        "artifacts": {
            "location": "arn:aws:s3:::codebuild-demo-artifact-repository/CodeBuild-RPM-Demo"
        }, 
        "projectName": "CodeBuild-RPM-Demo", 
        "timeoutInMinutes": 15, 
        "buildStatus": "IN_PROGRESS", 
        "environment": {
            "computeType": "BUILD_GENERAL1_SMALL", 
            "privilegedMode": false, 
            "image": "centos:7", 
            "type": "LINUX_CONTAINER", 
            "environmentVariables": []
        }, 
        "source": {
            "buildspec": "buildspec-libs-rpm.yml", 
            "type": "CODECOMMIT", 
            "location": "https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/codebuild-multispec"
        }, 
        "currentPhase": "SUBMITTED", 
        "startTime": 1500562366.239, 
        "id": "CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567", 
        "arn": "arn:aws:codebuild:eu-west-1:012345678912:build/CodeBuild-RPM-Demo:82d05f8a-b161-401c-82f0-83cb41eba567"
    }
}

After the build is completed successfully, check to see if the package appears in the artifact S3 bucket under the CodeBuild-RPM-Demo build project folder.

$ aws s3 ls s3://codebuild-demo-artifact-repository/CodeBuild-RPM-Demo/
2017-07-20 16:16:59       8108 cbsample-0.1-1.el7.centos.x86_64.rpm
2017-07-20 16:53:54       5320 cbsample-libs-0.1-1.el7.centos.x86_64.rpm

Conclusion

In this post, I have shown you how multiple buildspec files in the same source repository can be used to run multiple AWS CodeBuild build projects. I have also shown you how to provide a different buildspec file when starting the build.

For more information about AWS CodeBuild, see the AWS CodeBuild documentation. You can get started with AWS CodeBuild by using this step by step guide.


About the author

Prakash Palanisamy is a Solutions Architect for Amazon Web Services. When he is not working on Serverless, DevOps or Alexa, he will be solving problems in Project Euler. He also enjoys watching educational documentaries.

Avoiding TPM PCR fragility using Secure Boot

Post Syndicated from Matthew Garrett original http://mjg59.dreamwidth.org/48897.html

In measured boot, each component of the boot process is “measured” (ie, hashed and that hash recorded) in a register in the Trusted Platform Module (TPM) build into the system. The TPM has several different registers (Platform Configuration Registers, or PCRs) which are typically used for different purposes – for instance, PCR0 contains measurements of various system firmware components, PCR2 contains any option ROMs, PCR4 contains information about the partition table and the bootloader. The allocation of these is defined by the PC Client working group of the Trusted Computing Group. However, once the boot loader takes over, we’re outside the spec[1].

One important thing to note here is that the TPM doesn’t actually have any ability to directly interfere with the boot process. If you try to boot modified code on a system, the TPM will contain different measurements but boot will still succeed. What the TPM can do is refuse to hand over secrets unless the measurements are correct. This allows for configurations where your disk encryption key can be stored in the TPM and then handed over automatically if the measurements are unaltered. If anybody interferes with your boot process then the measurements will be different, the TPM will refuse to hand over the key, your disk will remain encrypted and whoever’s trying to compromise your machine will be sad.

The problem here is that a lot of things can affect the measurements. Upgrading your bootloader or kernel will do so. At that point if you reboot your disk fails to unlock and you become unhappy. To get around this your update system needs to notice that a new component is about to be installed, generate the new expected hashes and re-seal the secret to the TPM using the new hashes. If there are several different points in the update where this can happen, this can quite easily go wrong. And if it goes wrong, you’re back to being unhappy.

Is there a way to improve this? Surprisingly, the answer is “yes” and the people to thank are Microsoft. Appendix A of a basically entirely unrelated spec defines a mechanism for storing the UEFI Secure Boot policy and used keys in PCR 7 of the TPM. The idea here is that you trust your OS vendor (since otherwise they could just backdoor your system anyway), so anything signed by your OS vendor is acceptable. If someone tries to boot something signed by a different vendor then PCR 7 will be different. If someone disables secure boot, PCR 7 will be different. If you upgrade your bootloader or kernel, PCR 7 will be the same. This simplifies things significantly.

I’ve put together a (not well-tested) patchset for Shim that adds support for including Shim’s measurements in PCR 7. In conjunction with appropriate firmware, it should then be straightforward to seal secrets to PCR 7 and not worry about things breaking over system updates. This makes tying things like disk encryption keys to the TPM much more reasonable.

However, there’s still one pretty major problem, which is that the initramfs (ie, the component responsible for setting up the disk encryption in the first place) isn’t signed and isn’t included in PCR 7[2]. An attacker can simply modify it to stash any TPM-backed secrets or mount the encrypted filesystem and then drop to a root prompt. This, uh, reduces the utility of the entire exercise.

The simplest solution to this that I’ve come up with depends on how Linux implements initramfs files. In its simplest form, an initramfs is just a cpio archive. In its slightly more complicated form, it’s a compressed cpio archive. And in its peak form of evolution, it’s a series of compressed cpio archives concatenated together. As the kernel reads each one in turn, it extracts it over the previous ones. That means that any files in the final archive will overwrite files of the same name in previous archives.

My proposal is to generate a small initramfs whose sole job is to get secrets from the TPM and stash them in the kernel keyring, and then measure an additional value into PCR 7 in order to ensure that the secrets can’t be obtained again. Later disk encryption setup will then be able to set up dm-crypt using the secret already stored within the kernel. This small initramfs will be built into the signed kernel image, and the bootloader will be responsible for appending it to the end of any user-provided initramfs. This means that the TPM will only grant access to the secrets while trustworthy code is running – once the secret is in the kernel it will only be available for in-kernel use, and once PCR 7 has been modified the TPM won’t give it to anyone else. A similar approach for some kernel command-line arguments (the kernel, module-init-tools and systemd all interpret the kernel command line left-to-right, with later arguments overriding earlier ones) would make it possible to ensure that certain kernel configuration options (such as the iommu) weren’t overridable by an attacker.

There’s obviously a few things that have to be done here (standardise how to embed such an initramfs in the kernel image, ensure that luks knows how to use the kernel keyring, teach all relevant bootloaders how to handle these images), but overall this should make it practical to use PCR 7 as a mechanism for supporting TPM-backed disk encryption secrets on Linux without introducing a hug support burden in the process.

[1] The patchset I’ve posted to add measured boot support to Grub use PCRs 8 and 9 to measure various components during the boot process, but other bootloaders may have different policies.

[2] This is because most Linux systems generate the initramfs locally rather than shipping it pre-built. It may also get rebuilt on various userspace updates, even if the kernel hasn’t changed. Including it in PCR 7 would entirely break the fragility guarantees and defeat the point of all of this.

comment count unavailable comments

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Post Syndicated from Andy Katz original https://aws.amazon.com/blogs/compute/synchronizing-amazon-s3-buckets-using-aws-step-functions/

Constantin Gonzalez is a Principal Solutions Architect at AWS

In my free time, I run a small blog that uses Amazon S3 to host static content and Amazon CloudFront to distribute it world-wide. I use a home-grown, static website generator to create and upload my blog content onto S3.

My blog uses two S3 buckets: one for staging and testing, and one for production. As a website owner, I want to update the production bucket with all changes from the staging bucket in a reliable and efficient way, without having to create and populate a new bucket from scratch. Therefore, to synchronize files between these two buckets, I use AWS Lambda and AWS Step Functions.

In this post, I show how you can use Step Functions to build a scalable synchronization engine for S3 buckets and learn some common patterns for designing Step Functions state machines while you do so.

Step Functions overview

Step Functions makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Building applications from individual components that each perform a discrete function lets you scale and change applications quickly.

While this particular example focuses on synchronizing objects between two S3 buckets, it can be generalized to any other use case that involves coordinated processing of any number of objects in S3 buckets, or other, similar data processing patterns.

Bucket replication options

Before I dive into the details on how this particular example works, take a look at some alternatives for copying or replicating data between two Amazon S3 buckets:

  • The AWS CLI provides customers with a powerful aws s3 sync command that can synchronize the contents of one bucket with another.
  • S3DistCP is a powerful tool for users of Amazon EMR that can efficiently load, save, or copy large amounts of data between S3 buckets and HDFS.
  • The S3 cross-region replication functionality enables automatic, asynchronous copying of objects across buckets in different AWS regions.

In this use case, you are looking for a slightly different bucket synchronization solution that:

  • Works within the same region
  • Is more scalable than a CLI approach running on a single machine
  • Doesn’t require managing any servers
  • Uses a more finely grained cost model than the hourly based Amazon EMR approach

You need a scalable, serverless, and customizable bucket synchronization utility.

Solution architecture

Your solution needs to do three things:

  1. Copy all objects from a source bucket into a destination bucket, but leave out objects that are already present, for efficiency.
  2. Delete all "orphaned" objects from the destination bucket that aren’t present on the source bucket, because you don’t want obsolete objects lying around.
  3. Keep track of all objects for #1 and #2, regardless of how many objects there are.

In the beginning, you read in the source and destination buckets as parameters and perform basic parameter validation. Then, you operate two separate, independent loops, one for copying missing objects and one for deleting obsolete objects. Each loop is a sequence of Step Functions states that read in chunks of S3 object lists and use the continuation token to decide in a choice state whether to continue the loop or not.

This solution is based on the following architecture that uses Step Functions, Lambda, and two S3 buckets:

As you can see, this setup involves no servers, just two main building blocks:

  • Step Functions manages the overall flow of synchronizing the objects from the source bucket with the destination bucket.
  • A set of Lambda functions carry out the individual steps necessary to perform the work, such as validating input, getting lists of objects from source and destination buckets, copying or deleting objects in batches, and so on.

To understand the synchronization flow in more detail, look at the Step Functions state machine diagram for this example.

Walkthrough

Here’s a detailed discussion of how this works.

To follow along, use the code in the sync-buckets-state-machine GitHub repo. The code comes with a ready-to-run deployment script in Python that takes care of all the IAM roles, policies, Lambda functions, and of course the Step Functions state machine deployment using AWS CloudFormation, as well as instructions on how to use it.

Fine print: Use at your own risk

Before I start, here are some disclaimers:

  • Educational purposes only.

    The following example and code are intended for educational purposes only. Make sure that you customize, test, and review it on your own before using any of this in production.

  • S3 object deletion.

    In particular, using the code included below may delete objects on S3 in order to perform synchronization. Make sure that you have backups of your data. In particular, consider using the Amazon S3 Versioning feature to protect yourself against unintended data modification or deletion.

Step Functions execution starts with an initial set of parameters that contain the source and destination bucket names in JSON:

{
    "source":       "my-source-bucket-name",
    "destination":  "my-destination-bucket-name"
}

Armed with this data, Step Functions execution proceeds as follows.

Step 1: Detect the bucket region

First, you need to know the regions where your buckets reside. In this case, take advantage of the Step Functions Parallel state. This allows you to use a Lambda function get_bucket_location.py inside two different, parallel branches of task states:

  • FindRegionForSourceBucket
  • FindRegionForDestinationBucket

Each task state receives one bucket name as an input parameter, then detects the region corresponding to "their" bucket. The output of these functions is collected in a result array containing one element per parallel function.

Step 2: Combine the parallel states

The output of a parallel state is a list with all the individual branches’ outputs. To combine them into a single structure, use a Lambda function called combine_dicts.py in its own CombineRegionOutputs task state. The function combines the two outputs from step 1 into a single JSON dict that provides you with the necessary region information for each bucket.

Step 3: Validate the input

In this walkthrough, you only support buckets that reside in the same region, so you need to decide if the input is valid or if the user has given you two buckets in different regions. To find out, use a Lambda function called validate_input.py in the ValidateInput task state that tests if the two regions from the previous step are equal. The output is a Boolean.

Step 4: Branch the workflow

Use another type of Step Functions state, a Choice state, which branches into a Failure state if the comparison in step 3 yields false, or proceeds with the remaining steps if the comparison was successful.

Step 5: Execute in parallel

The actual work is happening in another Parallel state. Both branches of this state are very similar to each other and they re-use some of the Lambda function code.

Each parallel branch implements a looping pattern across the following steps:

  1. Use a Pass state to inject either the string value "source" (InjectSourceBucket) or "destination" (InjectDestinationBucket) into the listBucket attribute of the state document.

    The next step uses either the source or the destination bucket, depending on the branch, while executing the same, generic Lambda function. You don’t need two Lambda functions that differ only slightly. This step illustrates how to use Pass states as a way of injecting constant parameters into your state machine and as a way of controlling step behavior while re-using common step execution code.

  2. The next step UpdateSourceKeyList/UpdateDestinationKeyList lists objects in the given bucket.

    Remember that the previous step injected either "source" or "destination" into the state document’s listBucket attribute. This step uses the same list_bucket.py Lambda function to list objects in an S3 bucket. The listBucket attribute of its input decides which bucket to list. In the left branch of the main parallel state, use the list of source objects to work through copying missing objects. The right branch uses the list of destination objects, to check if they have a corresponding object in the source bucket and eliminate any orphaned objects. Orphans don’t have a source object of the same S3 key.

  3. This step performs the actual work. In the left branch, the CopySourceKeys step uses the copy_keys.py Lambda function to go through the list of source objects provided by the previous step, then copies any missing object into the destination bucket. Its sister step in the other branch, DeleteOrphanedKeys, uses its destination bucket key list to test whether each object from the destination bucket has a corresponding source object, then deletes any orphaned objects.

  4. The S3 ListObjects API action is designed to be scalable across many objects in a bucket. Therefore, it returns object lists in chunks of configurable size, along with a continuation token. If the API result has a continuation token, it means that there are more objects in this list. You can work from token to token to continue getting object list chunks, until you get no more continuation tokens.

By breaking down large amounts of work into chunks, you can make sure each chunk is completed within the timeframe allocated for the Lambda function, and within the maximum input/output data size for a Step Functions state.

This approach comes with a slight tradeoff: the more objects you process at one time in a given chunk, the faster you are done. There’s less overhead for managing individual chunks. On the other hand, if you process too many objects within the same chunk, you risk going over time and space limits of the processing Lambda function or the Step Functions state so the work cannot be completed.

In this particular case, use a Lambda function that maximizes the number of objects listed from the S3 bucket that can be stored in the input/output state data. This is currently up to 32,768 bytes, assuming (based on some experimentation) that the execution of the COPY/DELETE requests in the processing states can always complete in time.

A more sophisticated approach would use the Step Functions retry/catch state attributes to account for any time limits encountered and adjust the list size accordingly through some list site adjusting.

Step 6: Test for completion

Because the presence of a continuation token in the S3 ListObjects output signals that you are not done processing all objects yet, use a Choice state to test for its presence. If a continuation token exists, it branches into the UpdateSourceKeyList step, which uses the token to get to the next chunk of objects. If there is no token, you’re done. The state machine then branches into the FinishCopyBranch/FinishDeleteBranch state.

By using Choice states like this, you can create loops exactly like the old times, when you didn’t have for statements and used branches in assembly code instead!

Step 7: Success!

Finally, you’re done, and can step into your final Success state.

Lessons learned

When implementing this use case with Step Functions and Lambda, I learned the following things:

  • Sometimes, it is necessary to manipulate the JSON state of a Step Functions state machine with just a few lines of code that hardly seem to warrant their own Lambda function. This is ok, and the cost is actually pretty low given Lambda’s 100 millisecond billing granularity. The upside is that functions like these can be helpful to make the data more palatable for the following steps or for facilitating Choice states. An example here would be the combine_dicts.py function.
  • Pass states can be useful beyond debugging and tracing, they can be used to inject arbitrary values into your state JSON and guide generic Lambda functions into doing specific things.
  • Choice states are your friend because you can build while-loops with them. This allows you to reliably grind through large amounts of data with the patience of an engine that currently supports execution times of up to 1 year.

    Currently, there is an execution history limit of 25,000 events. Each Lambda task state execution takes up 5 events, while each choice state takes 2 events for a total of 7 events per loop. This means you can loop about 3500 times with this state machine. For even more scalability, you can split up work across multiple Step Functions executions through object key sharding or similar approaches.

  • It’s not necessary to spend a lot of time coding exception handling within your Lambda functions. You can delegate all exception handling to Step Functions and instead simplify your functions as much as possible.

  • Step Functions are great replacements for shell scripts. This could have been a shell script, but then I would have had to worry about where to execute it reliably, how to scale it if it went beyond a few thousand objects, etc. Think of Step Functions and Lambda as tools for scripting at a cloud level, beyond the boundaries of servers or containers. "Serverless" here also means "boundary-less".

Summary

This approach gives you scalability by breaking down any number of S3 objects into chunks, then using Step Functions to control logic to work through these objects in a scalable, serverless, and fully managed way.

To take a look at the code or tweak it for your own needs, use the code in the sync-buckets-state-machine GitHub repo.

To see more examples, please visit the Step Functions Getting Started page.

Enjoy!

[$] The Brave web browser

Post Syndicated from jake original https://lwn.net/Articles/725261/rss

The Brave web browser is a project from
a new company called Brave Software. It was founded by Brendan Eich, who is the
inventor of JavaScript and former developer and CTO at Mozilla; he
hopes to dramatically re-invent the advertising model of the web while
strengthening user anonymity and security. Brave’s value proposition is
that instead of being served advertisements from web sites that use the
revenue to pay their bills, users can opt to directly pay the content
providers of their choosing with cryptocurrency. Also, there is a
recognition of the
utility of targeted advertising, so users have an option of saving a local,
protected profile that can be used anonymously to obtain targeted
advertisements instead of having their online behavior tracked and sold by
a third party.

[$] What’s new in gnuplot 5.2

Post Syndicated from corbet original https://lwn.net/Articles/723818/rss

This article is a tour of some of the newest features in the gnuplot plotting utility.
Some of these features are already present in
the 5.0 release, and some are planned for the next
official release, which will be gnuplot 5.2. Highlights in the
upcoming release
include hypertext labels, more control over axes, a long-awaited ability to
add labels to contours, better lighting effects, and more; read on for the
details.

scanless – A Public Port Scan Scraper

Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/BzB9c8HkhZo/

scanless is a Python-based command-line utility that functions as a public port scan scraper, it can use websites that can perform port scans on your behalf. This is useful for early stages of penetration tests when you’d like to run a port scan on a host without having it originate from your IP address. Public […]

The post scanless –…

Read the full post at darknet.org.uk

Tips for Migrating to Apache HBase on Amazon S3 from HDFS

Post Syndicated from Bruno Faria original https://aws.amazon.com/blogs/big-data/tips-for-migrating-to-apache-hbase-on-amazon-s3-from-hdfs/

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3. Running HBase on S3 gives you several added benefits, including lower costs, data durability, and easier scalability.

HBase provides several options that you can use to migrate and back up HBase tables. The steps to migrate to HBase on S3 are similar to the steps for HBase on the Apache Hadoop Distributed File System (HDFS). However, the migration can be easier if you are aware of some minor differences and a few “gotchas.”

In this post, I describe how to use some of the common HBase migration options to get started with HBase on S3.

HBase migration options

Selecting the right migration method and tools is an important step in ensuring a successful HBase table migration. However, choosing the right ones is not always an easy task.

The following HBase feature and utilities help you migrate to HBase on S3:

  • snapshots
  • Export and Import
  • CopyTable

The following diagram summarizes the steps for each option.

Various factors determine the HBase migration method that you use. For example, EMR offers HBase version 1.2.3 as the earliest version that you can run on S3. Therefore, the HBase version that you’re migrating from can be an important factor in helping you decide. For more information about HBase versions and compatibility, see the HBase version number and compatibility documentation in the Apache HBase Reference Guide.

If you’re migrating from an older version of HBase (for example, HBase 0.94), you should test your application to make sure it’s compatible with newer HBase API versions. You don’t want to spend several hours migrating a large table only to find out that your application and API have issues with a different HBase version.

The good news is that HBase provides utilities that you can use to migrate only part of a table. This lets you test your existing HBase applications without having to fully migrate entire HBase tables. For example, you can use the Export, Import, or CopyTable utilities to migrate a small part of your table to HBase on S3. After you confirm that your application works with newer HBase versions, you can proceed with migrating the entire table using HBase snapshots.

Option 1: Migrate to HBase on S3 using snapshots

You can create table backups easily by using HBase snapshots. HBase also provides the ExportSnapshot utility, which lets you export snapshots to a different location, like S3. In this section, I discuss how you can combine snapshots with ExportSnapshot to migrate tables to HBase on S3.

For details about how you can use HBase snapshots to perform table backups, see Using HBase Snapshots in the Amazon EMR Release Guide and HBase Snapshots in the Apache HBase Reference Guide. These resources provide additional settings and configurations that you can use with snapshots and ExportSnapshot.

The following example shows how to use snapshots to migrate HBase tables to HBase on S3.

Note: Earlier HBase versions, like HBase 0.94, have a different snapshot structure than HBase 1.x, which is what you’re migrating to. If you’re migrating from HBase 0.94 using snapshots, you get a TableInfoMissingException error when you try to restore the table. For details about migrating from HBase 0.94 using snapshots, see the Migrating from HBase 0.94 section.

  1. From the source HBase cluster, create a snapshot of your table:
    $ echo "snapshot '<table_name>', '<snapshot_name>'" | hbase shell

  2. Export the snapshot to an S3 bucket:
    $ hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot <snapshot_name> -copy-to s3://<HBase_on_S3_root_dir>/

    For the -copy-to parameter in the ExportSnapshot utility, specify the S3 location that you are using for the HBase root directory of your EMR cluster. If your cluster is already up and running, you can find its S3 hbase.rootdir value by viewing the cluster’s Configurations in the EMR console, or by using the AWS CLI. Here’s the command to find that value:

    $ aws emr describe-cluster --cluster-id <cluster_id> | grep hbase.rootdir

  3. Launch an EMR cluster that uses the S3 storage option with HBase (skip this step if you already have one up and running). For detailed steps, see Creating a Cluster with HBase Using the Console in the Amazon EMR Release Guide. When launching the cluster, ensure that the HBase root directory is set to the same S3 location as your exported snapshots (that is, the location used in the -copy-to parameter in the previous step).
  4. Restore or clone the HBase table from that snapshot.
    • To restore the table and keep the same table name as the source table, use restore_snapshot:
      $ echo "restore_snapshot '<SNAPSHOT_NAME>'"| hbase shell

    • To restore the table into a different table name, use clone_snapshot:
      $ echo "clone_snapshot '<snapshot_name>', '<table_name>'" | hbase shell

Migrating from HBase 0.94 using snapshots

If you’re migrating from HBase version 0.94 using the snapshot method, you get an error if you try to restore from the snapshot. This is because the structure of a snapshot in HBase 0.94 is different from the snapshot structure in HBase 1.x.

The following steps show how to fix an HBase 0.94 snapshot so that it can be restored to an HBase on S3 table.

  1. Complete steps 1—3 in the previous example to create and export a snapshot.
  2. From your destination cluster, follow these steps to repair the snapshot:
    • Use s3-dist-cp to copy the snapshot data (archive) directory into a new directory. The archive directory contains your snapshot data. Depending on your table size, it might be large. Use s3-dist-cp to make this step faster:
      $ s3-dist-cp --src s3://<HBase_on_S3_root_dir>/.archive/<table_name> --dest s3://<HBase_on_S3_root_dir>/archive/data/default/<table_name>

    • Create and fix the snapshot descriptor file:
      $ hdfs dfs -mkdir s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc
      
      $ hdfs dfs -mv s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tableinfo.<*> s3://<HBase_on_S3_root_dir>/.hbase-snapshot/<snapshot_name>/.tabledesc

  3. Restore the snapshot:
    $ echo "restore_snapshot '<snapshot_name>'" | hbase shell

Option 2: Migrate to HBase on S3 using Export and Import

As I discussed in the earlier sections, HBase snapshots and ExportSnapshot are great options for migrating tables. But sometimes you want to migrate only part of a table, so you need a different tool. In this section, I describe how to use the HBase Export and Import utilities.

The steps to migrate a table to HBase on S3 using Export and Import is not much different from the steps provided in the HBase documentation. In those docs, you can also find detailed information, including how you can use them to migrate part of a table.

The following steps show how you can use Export and Import to migrate a table to HBase on S3.

  1. From your source cluster, export the HBase table:
    $ hbase org.apache.hadoop.hbase.mapreduce.Export <table_name> s3://<table_s3_backup>/<location>/

  2. In the destination cluster, create the target table into which to import data. Ensure that the column families in the target table are identical to the exported/source table’s column families.
  3. From the destination cluster, import the table using the Import utility:
    $ hbase org.apache.hadoop.hbase.mapreduce.Import '<table_name>' s3://<table_s3_backup>/<location>/

HBase snapshots are usually the recommended method to migrate HBase tables. However, the Export and Import utilities can be useful for test use cases in which you migrate only a small part of your table and test your application. It’s also handy if you’re migrating from an HBase cluster that does not have the HBase snapshots feature.

Option 3: Migrate to HBase on S3 using CopyTable

Similar to the Export and Import utilities, CopyTable is an HBase utility that you can use to copy part of HBase tables. However, keep in mind that CopyTable doesn’t work if you’re copying or migrating tables between HBase versions that are not wire compatible (for example, copying from HBase 0.94 to HBase 1.x).

For more information and examples, see CopyTable in the HBase documentation.

Conclusion

In this post, I demonstrated how you can use common HBase backup utilities to migrate your tables easily to HBase on S3. By using HBase snapshots, you can migrate entire tables to HBase on S3. To test HBase on S3 by migrating or copying only part of your tables, you can use the HBase Export, Import, or CopyTable utilities.

If you have questions or suggestions, please comment below.

 


About the Author

Bruno Faria is an EMR Solution Architect with AWS. He works with our customers to provide them architectural guidance for running complex applications on Amazon EMR. In his spare time, he enjoys spending time with his family and learning about new big data solutions.

 


Related

Low-Latency Access on Trillions of Records: FINRA’s Architecture Using Apache HBase on Amazon EMR with Amazon S3

 

 

 

 

 

 

AWS Big Data Blog Month in Review: March 2017

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/aws-big-data-blog-month-in-review-march-2017/

Another month of big data solutions on the Big Data Blog. Please take a look at our summaries below and learn, comment, and share. Thank you for reading!

Analyze Security, Compliance, and Operational Activity Using AWS CloudTrail and Amazon Athena
In this blog post, walk through how to set up and use the recently released Amazon Athena CloudTrail SerDe to query CloudTrail log files for EC2 security group modifications, console sign-in activity, and operational account activity.  

Big Updates to the Big Data on AWS Training Course!
AWS offers a range of training resources to help you advance your knowledge with practical skills so you can get more out of the cloud. We’ve updated Big Data on AWS, a three-day, instructor-led training course to keep pace with the latest AWS big data innovations. This course allows you to hear big data best practices from an expert, get answers to your questions in person, and get hands-on practice using AWS big data services. 

Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight
In this blog post, build a serverless architecture using Amazon Kinesis Firehose, AWS Lambda, Amazon S3, Amazon Athena, and Amazon QuickSight to collect, store, query, and visualize flow logs. In building this solution, you also learn how to implement Athena best practices with regard to compressing and partitioning data so as to reduce query latencies and drive down query costs. 

Amazon Redshift Monitoring Now Supports End User Queries and Canaries
The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster.  

Running R on Amazon Athena
In this blog post, connect R/RStudio running on an Amazon EC2 instance with Athena. You’ll learn to build a simple interactive application with Athena and R. Athena can be used to store and query the underlying data for your big data applications using standard SQL, while R can be used to interactively query Athena and generate analytical insights using the powerful set of libraries that R provides. This post has been translated into Japanese. 

Top 10 Performance Tuning Tips for Amazon Athena
In this blog post, we review the top 10 tips that can improve query performance. We focus on aspects related to storing data in Amazon S3 and tuning specific to queries. Amazon Athena uses Presto to run SQL queries and hence some of the advice will work if you are running Presto on Amazon EMR. This post has been translated into Japanese. 

Big Data Resources on the AWS Knowledge Center
The AWS Knowledge Center answers the questions we receive most frequently from AWS customers. It is a resource for you that is distinct from AWS Documentation, the AWS Discussion Forums, and the AWS Support Center. It covers questions from across every AWS service. This post is an introduction to Big Data resources on the AWS Knowledge Center. 

Encrypt and Decrypt Amazon Kinesis Records Using AWS KMS
In this bog post, learn to build encryption and decryption into sample Kinesis producer and consumer applications using the Amazon Kinesis Producer Library (KPL), the Amazon Kinesis Consumer Library (KCL), AWS KMS, and the aws-encryption-sdk. The methods and the techniques used in this post to encrypt and decrypt Kinesis records can be easily replicated into your architecture.

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

Amazon Redshift Monitoring Now Supports End User Queries and Canaries

Post Syndicated from Ian Meyers original https://aws.amazon.com/blogs/big-data/amazon-redshift-advance-monitoring-now-supports-end-user-queries-and-canaries/

Ian Meyers is a Solutions Architecture Senior Manager with AWS

The serverless Amazon Redshift Monitoring utility lets you gather important performance metrics from your Redshift cluster’s system tables and persists the results in Amazon CloudWatch. This serverless solution leverages AWS Lambda to schedule custom SQL queries and process the results. With this utility, you can use Amazon Cloudwatch to monitor disk-based queries, WLM queue wait time, alerts, average query times, and other data. This allows you to create visualizations with CloudWatch dashboards, generate Alerts on specific values, and create Rules to react to those Alerts.

redshift_monitoring

You can now create your own diagnostic queries and plug-in “canaries” that monitor the runtime of your most vital end user queries. These user-defined metrics can be used to create dashboards and trigger Alarms and should improve visibility into workloads running on a Cluster. They might also facilitate sizing discussions.

View the README to get started.


Related

Top 10 Performance Tuning Techniques for Amazon Redshift

o_redshift_update_1