Tag Archives: vps

Progressing from tech to leadership

Post Syndicated from Michal Zalewski original http://lcamtuf.blogspot.com/2018/02/on-leadership.html

I’ve been a technical person all my life. I started doing vulnerability research in the late 1990s – and even today, when I’m not fiddling with CNC-machined robots or making furniture, I’m probably clobbering together a fuzzer or writing a book about browser protocols and APIs. In other words, I’m a geek at heart.

My career is a different story. Over the past two decades and a change, I went from writing CGI scripts and setting up WAN routers for a chain of shopping malls, to doing pentests for institutional customers, to designing a series of network monitoring platforms and handling incident response for a big telco, to building and running the product security org for one of the largest companies in the world. It’s been an interesting ride – and now that I’m on the hook for the well-being of about 100 folks across more than a dozen subteams around the world, I’ve been thinking a bit about the lessons learned along the way.

Of course, I’m a bit hesitant to write such a post: sometimes, your efforts pan out not because of your approach, but despite it – and it’s possible to draw precisely the wrong conclusions from such anecdotes. Still, I’m very proud of the culture we’ve created and the caliber of folks working on our team. It happened through the work of quite a few talented tech leads and managers even before my time, but it did not happen by accident – so I figured that my observations may be useful for some, as long as they are taken with a grain of salt.

But first, let me start on a somewhat somber note: what nobody tells you is that one’s level on the leadership ladder tends to be inversely correlated with several measures of happiness. The reason is fairly simple: as you get more senior, a growing number of people will come to you expecting you to solve increasingly fuzzy and challenging problems – and you will no longer be patted on the back for doing so. This should not scare you away from such opportunities, but it definitely calls for a particular mindset: your motivation must come from within. Look beyond the fight-of-the-day; find satisfaction in seeing how far your teams have come over the years.

With that out of the way, here’s a collection of notes, loosely organized into three major themes.

The curse of a techie leader

Perhaps the most interesting observation I have is that for a person coming from a technical background, building a healthy team is first and foremost about the subtle art of letting go.

There is a natural urge to stay involved in any project you’ve started or helped improve; after all, it’s your baby: you’re familiar with all the nuts and bolts, and nobody else can do this job as well as you. But as your sphere of influence grows, this becomes a choke point: there are only so many things you could be doing at once. Just as importantly, the project-hoarding behavior robs more junior folks of the ability to take on new responsibilities and bring their own ideas to life. In other words, when done properly, delegation is not just about freeing up your plate; it’s also about empowerment and about signalling trust.

Of course, when you hand your project over to somebody else, the new owner will initially be slower and more clumsy than you; but if you pick the new leads wisely, give them the right tools and the right incentives, and don’t make them deathly afraid of messing up, they will soon excel at their new jobs – and be grateful for the opportunity.

A related affliction of many accomplished techies is the conviction that they know the answers to every question even tangentially related to their domain of expertise; that belief is coupled with a burning desire to have the last word in every debate. When practiced in moderation, this behavior is fine among peers – but for a leader, one of the most important skills to learn is knowing when to keep your mouth shut: people learn a lot better by experimenting and making small mistakes than by being schooled by their boss, and they often try to read into your passing remarks. Don’t run an authoritarian camp focused on total risk aversion or perfectly efficient resource management; just set reasonable boundaries and exit conditions for experiments so that they don’t spiral out of control – and be amazed by the results every now and then.

Death by planning

When nothing is on fire, it’s easy to get preoccupied with maintaining the status quo. If your current headcount or budget request lists all the same projects as last year’s, or if you ever find yourself ending an argument by deferring to a policy or a process document, it’s probably a sign that you’re getting complacent. In security, complacency usually ends in tears – and when it doesn’t, it leads to burnout or boredom.

In my experience, your goal should be to develop a cadre of managers or tech leads capable of coming up with clever ideas, prioritizing them among themselves, and seeing them to completion without your day-to-day involvement. In your spare time, make it your mission to challenge them to stay ahead of the curve. Ask your vendor security lead how they’d streamline their work if they had a 40% jump in the number of vendors but no extra headcount; ask your product security folks what’s the second line of defense or containment should your primary defenses fail. Help them get good ideas off the ground; set some mental success and failure criteria to be able to cut your losses if something does not pan out.

Of course, malfunctions happen even in the best-run teams; to spot trouble early on, instead of overzealous project tracking, I found it useful to encourage folks to run a data-driven org. I’d usually ask them to imagine that a brand new VP shows up in our office and, as his first order of business, asks “why do you have so many people here and how do I know they are doing the right things?”. Not everything in security can be quantified, but hard data can validate many of your assumptions – and will alert you to unseen issues early on.

When focusing on data, it’s important not to treat pie charts and spreadsheets as an art unto itself; if you run a security review process for your company, your CSAT scores are going to reach 100% if you just rubberstamp every launch request within ten minutes of receiving it. Make sure you’re asking the right questions; instead of “how satisfied are you with our process”, try “is your product better as a consequence of talking to us?”

Whenever things are not progressing as expected, it is a natural instinct to fall back to micromanagement, but it seldom truly cures the ill. It’s probable that your team disagrees with your vision or its feasibility – and that you’re either not listening to their feedback, or they don’t think you’d care. It’s good to assume that most of your employees are as smart or smarter than you; barking your orders at them more loudly or more frequently does not lead anyplace good. It’s good to listen to them and either present new facts or work with them on a plan you can all get behind.

In some circumstances, all that’s needed is honesty about the business trade-offs, so that your team feels like your “partner in crime”, not a victim of circumstance. For example, we’d tell our folks that by not falling behind on basic, unglamorous work, we earn the trust of our VPs and SVPs – and that this translates into the independence and the resources we need to pursue more ambitious ideas without being told what to do; it’s how we game the system, so to speak. Oh: leading by example is a pretty powerful tool at your disposal, too.

The human factor

I’ve come to appreciate that hiring decent folks who can get along with others is far more important than trying to recruit conference-circuit superstars. In fact, hiring superstars is a decidedly hit-and-miss affair: while certainly not a rule, there is a proportion of folks who put the maintenance of their celebrity status ahead of job responsibilities or the well-being of their peers.

For teams, one of the most powerful demotivators is a sense of unfairness and disempowerment. This is where tech-originating leaders can shine, because their teams usually feel that their bosses understand and can evaluate the merits of the work. But it also means you need to be decisive and actually solve problems for them, rather than just letting them vent. You will need to make unpopular decisions every now and then; in such cases, I think it’s important to move quickly, rather than prolonging the uncertainty – but it’s also important to sincerely listen to concerns, explain your reasoning, and be frank about the risks and trade-offs.

Whenever you see a clash of personalities on your team, you probably need to respond swiftly and decisively; being right should not justify being a bully. If you don’t react to repeated scuffles, your best people will probably start looking for other opportunities: it’s draining to put up with constant pie fights, no matter if the pies are thrown straight at you or if you just need to duck one every now and then.

More broadly, personality differences seem to be a much better predictor of conflict than any technical aspects underpinning a debate. As a boss, you need to identify such differences early on and come up with creative solutions. Sometimes, all you need is taking some badly-delivered but valid feedback and having a conversation with the other person, asking some questions that can help them reach the same conclusions without feeling that their worldview is under attack. Other times, the only path forward is making sure that some folks simply don’t run into each for a while.

Finally, dealing with low performers is a notoriously hard but important part of the game. Especially within large companies, there is always the temptation to just let it slide: sideline a struggling person and wait for them to either get over their issues or leave. But this sends an awful message to the rest of the team; for better or worse, fairness is important to most. Simply firing the low performers is seldom the best solution, though; successful recovery cases are what sets great managers apart from the average ones.

Oh, one more thought: people in leadership roles have their allegiance divided between the company and the people who depend on them. The obligation to the company is more formal, but the impact you have on your team is longer-lasting and more intimate. When the obligations to the employer and to your team collide in some way, make sure you can make the right call; it might be one of the the most consequential decisions you’ll ever make.

Amazon Lightsail Update – Launch and Manage Windows Virtual Private Servers

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-lightsail-update-launch-and-manage-windows-virtual-private-servers/

I first told you about Amazon Lightsail last year in my blog post, Amazon Lightsail – the Power of AWS, the Simplicity of a VPS. Since last year’s launch, thousands of customers have used Lightsail to get started with AWS, launching Linux-based Virtual Private Servers.

Today we are adding support for Windows-based Virtual Private Servers. You can launch a VPS that runs Windows Server 2012 R2, Windows Server 2016, or Windows Server 2016 with SQL Server 2016 Express and be up and running in minutes. You can use your VPS to build, test, and deploy .NET or Windows applications without having to set up or run any infrastructure. Backups, DNS management, and operational metrics are all accessible with a click or two.

Servers are available in five sizes, with 512 MB to 8 GB of RAM, 1 or 2 vCPUs, and up to 80 GB of SSD storage. Prices (including software licenses) start at $10 per month:

You can try out a 512 MB server for one month (up to 750 hours) at no charge.

Launching a Windows VPS
To launch a Windows VPS, log in to Lightsail , click on Create instance, and select the Microsoft Windows platform. Then click on Apps + OS if you want to run SQL Server 2016 Express, or OS Only if Windows is all you need:

If you want to use a Powershell script to customize your instance after it launches for the first time, click on Add launch script and enter the script:

Choose your instance plan, enter a name for your instance(s), and select the quantity to be launched, then click on Create:

Your instance will be up and running within a minute or so:

Click on the instance, and then click on Connect using RDP:

This will connect using a built-in, browser-based RDP client (you can also use the IP address and the credentials with another client):

Available Today
This feature is available today in the US East (Northern Virginia), US East (Ohio), US West (Oregon), EU (London), EU (Ireland), EU (Frankfurt), Asia Pacific (Singapore), Asia Pacific (Mumbai), Asia Pacific (Sydney), and Asia Pacific (Tokyo) Regions.

Jeff;

 

How Aussie ecommerce stores can compete with the retail giant Amazon

Post Syndicated from chris desantis original https://www.anchor.com.au/blog/2017/08/aussie-ecommerce-stores-vs-amazon/

The powerhouse Amazon retail store is set to launch in Australia toward the end of 2018 and Aussie ecommerce retailers need to ready themselves for the competition storm ahead.

2018 may seem a while away but getting your ecommerce site in tip top shape and ready to compete can take time. Check out these helpful hints from the Anchor crew.

Speed kills

If you’ve ever heard of the tale of the tortoise and the hare, the moral is that “slow and steady wins the race”. This is definitely not the place for that phrase, because if your site loads as slowly as a 1995 dial up connection, your ecommerce store will not, I repeat, will not win the race.

Site speed can be impacted by a number of factors and getting the balance right between a site that loads at lightning speed and delivering engaging content to your audience. There are many ways to check the performance of your site including Anchor’s free hosting check up or pingdom.

Taking action can boost the performance of your site:

Here’s an interesting blog from the WebCEO team about site speed’s impact on conversion rates on-page, or check out our previous blog on maximising site performance.

Show me the money

As an ecommerce store, getting credit card details as fast as possible is probably at the top of your list, but it’s important to remember that it’s an actual person that needs to hand over the details.

Consider the customer’s experience whilst checking out. Making people log in to their account before checkout, can lead to abandoned carts as customers try to remember the vital details. Similarly, making a customer enter all their details before displaying shipping costs is more of an annoyance than a benefit.

Built for growth

Before you blast out a promo email to your entire database or spend up big on PPC, consider what happens when this 5 fold increase in traffic, all jumps onto your site at around the same time.

Will your site come screeching to a sudden halt with a 504 or 408 error message, or ride high on the wave of increased traffic? If you have fixed infrastructure such as a dedicated server, or are utilising a VPS, then consider the maximum concurrent users that your site can handle.

Consider this. Amazon.com.au will be built on the scalable cloud infrastructure of Amazon Web Services and will utilise all the microservices and data mining technology to offer customers a seamless, personalised shopping experience. How will your business compete?

Search ready

Being found online is important for any business, but for ecommerce sites, it’s essential. Gaining results from SEO practices can take time so beware of ‘quick fix guarantees’ from outsourced agencies.

Search Engine Optimisation (SEO) practices can have lasting effects. Good practices can ensure your site is found via organic search without huge advertising budgets, on the other hand ‘black hat’ practices can push your ecommerce store into search oblivion.

SEO takes discipline and focus to get right. Here are some of our favourite hints for SEO greatness from those who live and breathe SEO:

  • Optimise your site for mobile
  • Use Meta Tags wisely
  • Leverage Descriptive alt tags and image file names
  • Create content for people, not bots (keyword stuffing is a no no!)

SEO best practices are continually evolving, but creating a site that is designed to give users a great experience and give them the content they expect to find.

Google My Business is a free service that EVERY business should take advantage of. It is a listing service where your business can provide details such as address, phone number, website, and trading hours. It’s easy to update and manage, you can add photos, a physical address (if applicable), and display shopper reviews.

Get your site ship shape

Overwhelmed by these starter tips? If you are ready to get your site into tip top shape–get in touch. We work with awesome partners like eWave who can help create a seamless online shopping experience.

 

The post How Aussie ecommerce stores can compete with the retail giant Amazon appeared first on AWS Managed Services by Anchor.

Amazon Lightsail Update – 9 More Regions and Global Console

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-lightsail-update-9-more-regions-and-global-console/

Amazon Lightsail lets you launch Virtual Private Servers on AWS with just a few clicks. With prices starting at $5 per month, Lightsail takes care of the heavy lifting and gives you a simple way to build and host applications. As I showed you in my re:Invent post (Amazon Lightsail – The Power of AWS, the Simplicity of a VPS), you can choose a configuration from a menu and launch a virtual machine preconfigured with SSD-based storage, DNS management, and a static IP address.

Since we launched in November, many customers have used Lightsail to launch Virtual Private Servers. For example, Monash University is using Amazon Lightsail to rapidly re-platform a number of CMS services in a simple and cost-effective manner. They have already migrated 50 workloads and are now thinking of creating an internal CMS service based on Lightsail to allow staff and students to create their own CMS instances in a self-service manner.

Today we are expanding Lightsail into nine more AWS Regions and launching a new, global console.

New Regions
At re:Invent we made Lightsail available in the US East (Northern Virginia) Region. Earlier this month we added support for several additional Regions in the US and Europe. Today we are launching Lightsail in four of our Asia Pacific Regions, bringing the total to ten. Here’s the full list:

  • US East (Northern Virginia)
  • US West (Oregon)
  • US East (Ohio)
  • EU (London)
  • EU (Frankfurt)
  • EU (Ireland)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Tokyo)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)

Global Console
The updated Lightsail console makes it easy for you to create and manage resources in one or more Regions. I simply choose the desired Region when I create a new instance:

I can see all of my instances and static IP addresses on the same page, no matter what Region they are in:

And I can perform searches that span all of my resources and Regions. All of my LAMP stacks:

Or all of my resources in the EU (Ireland) Region:

I can perform a similar search on the Snapshots tab:

A new DNS zones tab lets me see my existing zones and create new ones:

Creation of SSH keypairs is now specific to a Region:

I can manage my key pairs on a Region-by-Region basis:

Static IP addresses are also specific to a particular Region:

Available Now
You can use the new Lightsail console and create resources in all ten Regions today!

Jeff;

 

News from the AWS Summit in Berlin – 3rd AZ & Lightsail in Frankfurt and Another Polly Voice

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/news-from-the-aws-summit-in-berlin-3rd-az-lightsail-in-frankfurt-and-another-polly-voice/

We launched the AWS Region in Frankfurt in the fall of 2014 and opened the AWS Marketplace for the Region the next year.

Our customers in Germany come in all shapes and sizes: startups, mid-market, enterprise, and public sector. These customers have made great use of the new Region, building and running applications and businesses that serve Germany, Europe, and more. They rely on the broad collection of security features, certifications, and assurances provided by AWS to help protect and secure their customer data, in accord with internal and legal requirements and regulations. Our customers in Germany also take advantage of the sales, support, and architecture resources and expertise located in Berlin, Dresden, and Munich.

The AWS Summit in Berlin is taking place today and we made some important announcements from the stage. Here’s a summary:

  • Third Availability Zone in Frankfurt
  • Amazon Lightsail in Frankfurt
  • New voice for Amazon Polly

Third Availability Zone in Frankfurt
We will be opening an additional Availability Zone (AZ) in the EU (Frankfurt) Region in mid-2017 in response to the continued growth in the use of AWS. This brings us up to 43 Availability Zones within 16 geographic Regions around the world. We are also planning to open five Availability Zones in new AWS Regions in France and China later this year (see the AWS Global Infrastructure maps for more information).

AWS customers in Germany are already making plans to take advantage of the new AZ. For example:

Siemens expects to gain additional flexibility by mirroring their services across all of the AZs. It will also allow them to store all of their data in Germany.

Zalando will do the same, mirroring their services across all of the AZs and looking ahead to moving more applications to the cloud.

Amazon Lightsail in Frankfurt
Amazon Lightsail lets you launch a virtual machine preconfigured with SSD storage, DNS management, and a static IP address in a matter of minutes (read Amazon Lightsail – The Power of AWS, the Simplicity of a VPS to learn more).

Amazon Lightsail is now available in the EU (Frankfurt) Region and you can start using it today. This allows you to use it to host applications that are required to store customer data or other sensitive information in Germany.

New Voice for Amazon Polly
Polly gives you high-quality, natural-sounding male and female speech in multiple languages. Today we are adding another German-speaking female voice to Polly, bringing the total number of voices to 48:

Like the German voice of Alexa, Vicki (the new voice) is fluent and natural. Vicki is able to fluently and intelligently pronounce the Anglicisms frequently used in German texts, including the fully inflected versions. To get started with Polly, open up the Polly Console or read the Polly Documentation.

I’m looking forward to hearing more about the continued growth and success of our customers in and around Germany!

Jeff;

How Important is Hosting Location? Questions to Ask Your Hosting Provider

Post Syndicated from Sarah Wilson original https://www.anchor.com.au/blog/2017/05/site-hosting-location/

The importance of the location of your hosting partner will depend on your organisation requirements and your business needs. For a small business focussed on a single country with fairly low traffic, starting with shared hosting, or small virtual private server is generally the most cost effective place to host your site.  Hosting your site or web application in the same geographical location as your website visitors reduces page load times and latency (the lag between requesting data and receiving it) and will greatly improve user experience. If you are running a business critical website or ecommerce application, or have customers or visitors from various global locations, then incorrectly placing your website in an unsuitably located data centre will cost you much more than a monthly hosting fee!

Does the Location of My Hosting Provider Matter?

Location also matters when it comes to service.  Selecting a hosting provider you should be aware of their usual operation hours and ensure this fits with your companies requirements, and timezone! Why does this matter? You can’t choose when an unexpected outage of your site happens, so if your hosting provider is based in a different time zone with limited service hours, there may be no one to help you.  Here at Anchor,  we’ve implemented a “follow the sun support” model, around our 8am-6pm AEST operating hours. This means that anywhere you are in the world with a problem, you can pick up the phone and our friendly support team will be on hand to help.

But Why Do I Care Where the Data Centre is?

Data centres require state of the art cooling and power in order to keep the servers and hardware in perfect condition. Redundancy in the network is also a vital part of infrastructure, i.e. if something in the network fails, there is backup infrastructure, power or cooling so it can operate as normal. The data centre that Anchor uses for our shared hosting and VPS is brand new, with 24/7 security and state of the art technologies, located in Sydney.

On the flip side of this location conundrum is how you may serve customers in locations outside of Australia which can now be achieved via a public cloud offering such as Amazon Web Services (AWS). With AWS, your business can leverage a network of global data centres around the globe so you can serve your customers and audiences wherever they are located! For example, if your business operates in Australia and you have customers in Singapore, United Kingdom and USA, you can now have your application deployed in all four data centres so all of your customer sessions can be routed to the closest server.

Anchor provides fully managed hosting in our own Sydney based data centre, or on public clouds such as AWS. Get in touch to find out more about our managed hosting services.

The post How Important is Hosting Location? Questions to Ask Your Hosting Provider appeared first on AWS Managed Services by Anchor.

Maximising site performance: 5 key considerations

Post Syndicated from Davy Jones original https://www.anchor.com.au/blog/2017/03/maximising-site-performance-key-considerations/

The ongoing performance of your website or application is an area where ‘not my problem’ can be a recurring sentiment from all stakeholders.  It’s not just a case of getting your shiny new website or application onto the biggest, spec-ed-up, dedicated server or cloud instance that money can buy because there are many factors that can influence the performance of your website that you, yes you, need to make friends with.

The relationship between site performance and business outcomes

Websites have evolved into web applications, starting out as simple text in html format to complex, ‘rich’ multimedia content requiring buckets of storage and computing power. Your server needs to run complex scripts and processes, and serve up content to global visitors because let’s face it, you probably have customers everywhere (or at least have plans to achieve a global customer base ). It is a truth universally acknowledged, that the performance of your website is directly related to customer experience, so underestimating the impact of having poor site performance will absolutely affect your brand reputation, sales revenue and business outcomes negatively, jeopardising your business’ success.

Site performance stakeholders

There is an increasing range of literature around the growing importance of optimising site performance for maximum customer experience but who is responsible for owning the customer site experience? Is it the marketing team, development team, digital agency or your hosting provider? The short answer is that all of the stakeholders can either directly or indirectly impact your site performance.

Let’s explore this shared responsibility in more detail, let’s break it down into five areas that affect a website’s performance.

5 key site performance considerations

In order to truly appreciate the performance of your website or application, you must take into consideration 5 key areas that affect your website’s ability to run at maximum performance:

  1. Site Speed
  2. Reliability and availability
  3. Code Efficiency
  4. Scalability
  5. Development Methodology
1. Site Speed

Site speed is the most critical metric. We all know and have experienced the frustration of “this site is slow, it takes too long to load!”. It’s the main (and sometimes, only) metric that most people would think about when it comes to the performance of a web application.

But what does it mean for a site to be slow? Well, it usually comes down to these factors:

a. The time it takes for the server to respond to a visitor requesting a page.
b. The time it takes to download all necessary content to display the website.
c.  The time it takes for your browser to load and display all the content.

Usually, the hosting provider will look over  (a), and the developers would look over (b) and (c), as those points are directly related to the web application.

2. Reliability and availability

Reliability and availability go hand-in-hand.

There’s no point in having a fast website if it’s not *reliably* fast. What do we mean by that?

Well, would you be happy if your website was only fast sometimes? If your Magento retail store is lightning fast when you are the only one using it, but becomes unresponsive during a sale, then the service isn’t performing up to scratch. The hosting provider has to provide you with a service that stays up, and can withstand the traffic going to it.

Outages are also inevitable, as 100% uptime is a myth. But with some clever infrastructure designs, we can minimise downtime as close to zero as we can get! Here at Anchor, our services are built with availability in mind. If your service is inaccessible, then it’s not reliable.

Our multitude of hosting options on offer such as VPS, dedicated and cloud are designed specifically for your needs. Proactive and reactive support, and hands-on management means your server stays reliable and available.

We know some businesses are concerned about the very public outage of AWS in the US recently, however AWS have taken action across all regions to prevent this from occurring again. AWS’s detailed response can be found at S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.

As an advanced consulting partner with Amazon Web Services (AWS), we can guide customers through the many AWS configurations that will deliver the reliability required.  Considerations include utilising multiple availability zones, read-only replicas, automatic backups, and disaster recovery options such as warm standby.  

3. Code Efficiency

Let’s talk about efficiency of a codebase, that’s the innards of the application.

The code of an application determines how hard the CPU (the brain of your computer) has to work to process all the things the application wants to be able to do. The more work your application performs, the harder the CPU has to work to keep up.

In short, you want code to be efficient, and not have to do extra, unnecessary work. Here is a quick example:

# Example 1:    2 + 2 = 4

# Example 2:    ( ( 1 + 5) / 3 ) * 1 ) + 2 = 4

The end result is the same, but the first example gets straight to the point. It’s much easier to understand and faster to process. Efficient code means the server is able to do more with the same amount of resources, and most of the time it would also be faster!

We work with many code efficient partners who create awesome sites that drive conversions.  Get in touch if you’re looking for a code efficient developer, we’d be happy to suggest one of our tried and tested partners

4. Scalability

Accurately predicting the spikes in traffic to your website or application is tricky business.  Over or under-provisioning of infrastructure can be costly, so ensuring that your build has the potential to scale can help your website or application to optimally perform at all times.  Scaling up involves adding more resources to the current systems. Scaling out involves adding more nodes. Both have their advantages and disadvantages. If you want to know more, feel free to talk to any member of our sales team to get started.

If you are using a public cloud infrastructure like Amazon Web Services (AWS) there are several ways that scalability can be built into your infrastructure from the start.  Clusters are at the heart of scalability and there are a number of tools can optimise your cluster efficiency such as Amazon CloudWatch, that can trigger scaling activities, and Elastic Load Balancing to direct traffic to the various clusters within your auto scaling group.  For developers wanting complete control over AWS resources, Elastic Beanstalk may be more appropriate.

5. Development Methodology

Development methodologies describe the process of what needs to happen in order to introduce changes to software. A commonly used methodology nowadays is the ‘DevOps’ methodology.

What is DevOps?

It’s the union of Developers and IT Operations teams working together to achieve a common goal.

How can it improve your site’s performance?

Well, DevOps is a way of working, a culture that introduces close collaboration between the two teams of Developers and IT Operations in a single workflow.   By integrating these teams the process of creating, testing and deploying software applications can be streamlined. Instead of each team working in a silo, cross-functional teams work together to efficiently solve problems to get to a stable release faster. Faster releases mean that your website or application gets updates more frequently and updating your application more frequently means you are faster to fix bugs and introduce new features. Check out this article ‘5 steps to prevent your website getting hacked‘ for more details. 

The point is the faster you can update your applications the faster it is for you to respond to any changes in your situation.  So if DevOps has the potential to speed up delivery and improve your site or application performance, why isn’t everyone doing it?

Simply put, any change can be hard. And for a DevOps approach to be effective, each team involved needs to find new ways of working harmoniously with other teams toward a common goal. It’s not just a process change that is needed, toolsets, communication and company culture also need to be addressed.

The Anchor team love putting new tools through their paces.  We love to experiment and iterate on our processes in order to find one that works with our customers. We are experienced in working with a variety of teams, and love to challenge ourselves. If you are looking for an operations team to work with your development team, get in touch.

***
If your site is running slow or you are experiencing downtime, we can run a free hosting check up on your site and highlight the ‘quick wins’ on your site to boost performance.

The post Maximising site performance: 5 key considerations appeared first on AWS Managed Services by Anchor.

About that Giuliani website…

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/about-that-giuliani-website.html

Rumors are that Trump is making Rudy Giuliani some sort of “cyberczar” in the new administration. Therefore, many in the cybersecurity scanned his website “www.giulianisecurity.com” to see if it was actually secure from hackers. The results have been laughable, with out-of-date software, bad encryption, unnecessary services, and so on.

But here’s the deal: it’s not his website. He just contracted with some generic web designer to put up a simple page with just some basic content. It’s there only because people expect if you have a business, you also have a website.
That website designer in turn contracted some basic VPS hosting service from Verio. It’s a service Verio exited around March of 2016, judging by the archived page.
The Verio service promised “security-hardened server software” that they “continually update and patch”. According to the security scans, this is a lie, as the software is all woefully out-of-date. According OS fingerprint, the FreeBSD image it uses is 10 years old. The security is exactly what you’d expect from a legacy hosting company that’s shut down some old business.
You can probably break into Giuliani’s server. I know this because other FreeBSD servers in the same data center have already been broken into, tagged by hackers, or are now serving viruses.
But that doesn’t matter. There’s nothing on Giuliani’s server worth hacking. The drama over his security, while an amazing joke, is actually meaningless. All this tells us is that Verio/NTT.net is a crappy hosting provider, not that Giuliani has done anything wrong.

Amazon Lightsail – The Power of AWS, the Simplicity of a VPS

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-lightsail-the-power-of-aws-the-simplicity-of-a-vps/

Some people like to assemble complex systems (houses, computers, or furniture) from parts. They relish the planning process, carefully researching each part and selecting those that give them the desired balance of power and flexibility. With planning out of the way, they enjoy the process of assembling the parts into a finished unit. Other people do not find this do-it-yourself (DIY) approach attractive or worthwhile, and are simply interested in getting to the results as quickly as possible without having to make too many decisions along the way.

Sound familiar?

I believe that this model applies to systems architecture and system building as well. Sometimes you want to take the time to hand-select individual AWS components (servers, storage, IP addresses, and so forth) and put them together on your own. At other times you simply need a system that is preconfigured and preassembled, and is ready to run your web applications with no system-building effort on your part.

In many cases, those seeking a preassembled system turned to a Virtual Private Server, or VPS. With a VPS, you are presented with a handful of options, each ready to run, and available to you for a predictable monthly fee.

While the VPS is a perfect getting-started vehicle, over time the environment can become constrained. At a certain point you may need to step outside the boundaries of the available plans as your needs grow, only to find that you have no options for incremental improvement, and are faced with the need to make a disruptive change. Or, you may find that your options for automated scaling or failover are limited, and that you need to set it all up yourself.

Introducing Amazon Lightsail
Today we are launching Amazon Lightsail. With a couple of clicks you can choose a configuration from a menu and launch a virtual machine preconfigured with SSD-based storage, DNS management, and a static IP address. You can launch your favorite operating system (Amazon Linux AMI, Ubuntu, CentOS, FreeBSD, or Debian), developer stack (LAMP, LEMP, MEAN, or Node.js), or application (Drupal, Joomla, Redmine, GitLab, and many others), with flat-rate pricing plans that start at $5 per month including a generous allowance for data transfer.

Here are the plans and the configurations:

You get the simplicity of a VPS, backed by the power, reliability, and security of AWS. As your needs grow, you will have the ability to smoothly step outside of the initial boundaries and connect to additional AWS database, messaging, and content distribution services.

All in all, Lightsail is the easiest way for you to get started on AWS and jumpstart your cloud projects, while giving you a smooth, clear path into the future.

A Quick Tour
Let’s take a quick tour of Amazon Lightsail! Each page of the Lightsail console includes a Quick Assist tab. You can click on it at any time in order to access context-sensitive documentation that will help you to get the most out of Lightsail:

I start at the main page. I have no instances or other resources at first:

I click on Create Instance to get moving. I choose my machine image (an App and an OS, or simply an OS) an instance plan, and give my instance a name, all on one page:

I can launch multiple instances, set up a configuration script, or specify an alternate SSH keypair if I’d like. I can also choose an Availability Zone. I’ll choose WordPress on the $10 plan, leave everything else as-is, and click on Create. It is up and running within seconds:

I can manage the instance by clicking on it:

My instance has a public IP address that I can open in my browser. WordPress is already installed, configured, and running:

I’ll need the WordPress password in order to finish setting it up. I click on Connect using SSH on the instance management page and I’m connected via a browser-based SSH terminal window without having to do any key management or install any browser plugins. The WordPress admin password is stored in file bitnami_application_password in the ~bitnami directory (the image below shows a made-up password):

You can bookmark the terminal window in order to be able to access it later with just a click or two.

I can manage my instance from the menu bar:

For example, I can access the performance metrics for my instance:

And I can manage my firewall settings:

I can capture the state of my instance by taking a Snapshot:

Later, I can restore the snapshot to a fresh instance:

I can also create static IP addresses and make use of domain names:

Advanced Lightsail – APIs and VPC Peering
Before I wrap up, let’s talk about a few of the more advanced features of Amazon Lightsail – APIs and VPC Peering.

As is almost always the case with AWS, there’s a full set of APIs behind all of the console functionality that we just reviewed. Here are just a few of the more interesting functions:

  • GetBundles – Get a list of the bundles (machine configurations).
  • CreateInstances – Create one or more Lightsail instances.
  • GetInstances – Get a list of all Lightsail instances.
  • GetInstance – Get information about a specific instance.
  • CreateInstanceSnapshot – Create a snapshot of an instance.
  • CreateInstanceFromSnapshot – Create an instance from a snapshot.

All of the Lightsail instances within an account run within a “shadow” VPC that is not visible in the AWS Management Console. If the code that you are running on your Lightsail instances needs access to other AWS resources, you can set up VPC peering between the shadow VPC and another one in your account, and create the resources therein. Click on Account (top right), scroll down to Advanced features, and check VPC peering:

You can now connect your Lightsail apps to other AWS resources that are running within a VPC.

Pricing and Availability
We are launching Amazon Lightsail today in the US East (Northern Virginia) Region, and plan to expand it to other regions in the near future.

Prices start at $5 per month.

Jeff;

Load Balancing and B2 Cloud Storage

Post Syndicated from Elliott Sims original https://www.backblaze.com/blog/load-balancing-and-b2-cloud-storage/

Load Balancer
A few months ago we announced Backblaze B2, our cloud storage product. Cloud storage presents different challenges versus cloud backup in the server environment. Load balancing is one such issue. Let’s take a look at the challenges that have come up, and our solutions.
A load balancer is a server or specialized device that distributes load among the servers that actually do work and reply to requests. In addition to allowing the work to be distributed among several servers, it also makes sure that requests only get sent to healthy servers that are prepared to handle them.
For Backblaze Personal Backup and Backblaze Business Backup products, we’ve had it easy in terms of load balancing: we wrote the client, so we can just make it smart enough to ask us which server to talk to. No separate “load balancer” needed. The Backblaze cloud backup products are also very tolerant of short outages, since the client will just upload the files slightly later.
For B2, though, the clients are web browsers or programming language libraries like libcurl. They just make a single request for a file and expect an answer immediately. This means we need a load balancer both to distribute the load and allow us to take individual servers offline to update them.
Option 1: Layer 7, full proxy
The simplest and most flexible way to do load balancing is to have a pair of hosts, one active and one standby, that accept HTTPS connections from the client and create new connections to the server, then proxy the traffic back and forth between the two. This is usually referred to as “layer 7 full proxy” load balancing.
This doesn’t generally require any special setup on the server, other than perhaps making sure it understands the x-forwarded-for header so it knows the actual client’s IP address. It does have one big downside, though: the load balancer has to have enough bandwidth to handle every request and response in both directions, and enough CPU to handle TCP and SSL in both directions. Modern processors with AES-NI – onboard AES encryption and decryption – help a lot with this, but it can still quickly become a performance bottleneck when you’re talking about transferring large files at 1Gbps or higher.
Option 2: Layer 4, full proxy
Another option, if you want to reduce the burden on the load balancers, is layer 4 load balancing. The load balancers accept TCP connections from the client and create a new TCP session to the server, but they proxy through the HTTPS traffic inside the TCP session without decrypting or re-encrypting it. This still requires that the load balancer have enough bandwidth to handle all your traffic, but a lot less CPU compared to layer 7. Unfortunately, it also means that your servers don’t really have a good way to see the original client’s IP address short of hijacking the TCP Options field with a proprietary extension.
Option 3: DSR
All of this is adding a lot of work layered on top of a load balancer’s basic purpose: to distribute client requests among multiple healthy back-end servers. To do this, the load balancer only needs to see the request and modify the destination at the outermost layers. No need to parse all the way to layer 7, and no need to even see the response. This is generally called DSR (Direct Server Return).
Especially when serving large files with SSL, DSR requires minimal amounts of bandwidth and CPU power on the load balancer. Because the source IP address is unchanged, the server can see the original client’s IP without even needing an x-forwarded-for header. This does have a few tradeoffs, though: it requires a fairly complex setup not only on the load balancers, but also on the individual servers. In full-proxy modes the load balancer can intercept bad responses and retry the request on a different back-end server or display a friendlier error message to the client, but since the response bypasses the load balancer in DSR mode this isn’t possible. This also makes health-checking tricky because there’s no path for responses from the back-end host to the load balancer.
After some testing, we ended up settling on DSR. Although it’s a lot more complicated to set up and maintain, it allows us to handle large amounts of traffic with minimal hardware. It also makes it easy to fulfill our goal of keeping user traffic encrypted even within our datacenter.
How does it work?
DSR load balancing requires two things:

A load balancer with the VIP address attached to an external NIC and ARPing, so that the rest of the network knows it “owns” the IP.
Two or more servers on the same layer 2 network that also have the VIP address attached to a NIC, either internal or external, but are *not* replying to ARP requests about that address. This means that no other servers on the network know that the VIP exists anywhere but on the load balancer.

A request packet will enter the network, and be routed to the load balancer. Once it arrives there, the load balancer leaves the source and destination IP addresses intact and instead modifies the destination MAC address to that of a server then puts the packet back on the network. The network switch only understands MAC addresses, so it forwards the packet on to the correct server.
blog-network-diagram
When the packet arrives at the server’s network interface, it checks to make sure the destination MAC address matches its own. It does, so accepts the packet. It then, separately, checks to see whether the destination IP address is one attached to it somehow. It is, even though the rest of the network doesn’t know it, so it accepts the packet and passes it on to the application. The application then sends a response with the VIP as the source IP address and the client as the destination IP, so it’s routed directly to the client without passing back through the load balancer.
How do I set it up?
DSR setup is very specific to each individual network setup, but we’ll try to provide enough information that this can be adapted to most cases. The simplest way is probably to just pay a vendor like F5, A10, or Kemp to handle it. You’ll still need the complex setup on the individual hosts, though, and the commercial options tend to be pretty pricey. We also tend to prefer open-source over black-box solutions, since they’re more flexible and debuggable.
HAProxy and likely other applications can do DSR, but we ended up using IPVS (formerly known as LVS). The core packet routing of IPVS is actually part of the Linux kernel, and then various user-space utilities are used for health checks and other management. For user-space management, there’s a number of other good options like Keepalived, Ldirectord, Piranha, and Google’s recently-released Seesaw. We ended up choosing Keepalived because we also wanted VRRP support for failing over between load balancers, and because it’s both simple and stable/mature.
Setting up IPVS and Keepalived
Good news! If your kernel is 2.6.10 or newer (and it almost certainly is), IPVS is already included. If /proc/net/ip_vs exists, it’s already loaded. If not, “modprobe ip_vs” will load the module. Most distributions will probably compile it as a kernel module, but your results with VPS providers may vary. At this point, you’ll probably also want to install the ipvsadm utility so you can manually inspect and modify the IPVS config.
The Keepalived load-balancing config is fairly straightforward: a virtual_server section with a real_server section inside of it for each back-end server. Most of the rest depends on your specific needs, but you’ll want to set lb_kind to “DR”. You can use SSL_GET as a simple health checker, but we use “MISC_CHECK” with a custom script, which lets us stop sending new traffic to a server that’s shutting down by setting its weight to 0.
The host config is where things get a bit more complicated. The important part is that the VIP address is assigned to an interface, but the server isn’t sending ARP replies about it. There’s a few ways to do this that work about equally well, but we use arptables rules defined in /etc/network/interfaces:

pre-up /sbin/arptables -I INPUT -j DROP -d <VIPADDRESS>
pre-up /sbin/arptables -I OUTPUT -j mangle -s <VIPADDRESS> ‐‐mangle-ip-s <SERVERIPADDRESS>
pre-down /sbin/arptables -D INPUT -j DROP -d <VIPADDRESS>
pre-down /sbin/arptables -D OUTPUT -j mangle -s <VIPADDRESS> ‐‐mangle-ip-s <SERVERIPADDRESS>

Once the arptables rules are in place, you’ll want to add the actual address to the interface:

post-up /sbin/ip addr add <VIPADDRESS>/32 dev $LOGICAL
pre-down /sbin/ip addr del <VIPADDRRESS>/32 dev $LOGICAL

If your backend server doesn’t have an actual external IP and normally talks to the outside via NAT, you will need to create a source-based route, also in the interfaces config:

post-up /sbin/ip rule add from <VIPADDRESS> lookup 200
pre-down /sbin/ip rule del from <VIPADDRESS> lookup 200
post-up /sbin/ip route add default via <VIPADDRESS> dev $LOGICAL table 200
pre-down /sbin/ip route del default via <VIPADDRESS> dev $LOGICAL table 200

Finally, make sure your webserver (or other daemon) is listening on specifically, or on all addresses (0.0.0.0 or :::).
It’s not working!

First, make sure the load balancer and server are on the same layer-2 network. If they’re on different subnets, none of this will work.
Check the value of /proc/sys/net/ipv4/conf/*/rp_filter and make sure it’s not set to 1 anywhere.
Run ‘tcpdump -e -n host ’ on the load balancer and make sure that the requests are reaching the load balancer with a destination IP of and a destination MAC address belonging to the load balancer, then leaving again with the same source and destination IP but the MAC address of the back-end server.
Run “ipvsadm” on the load balancer and make sure IPVS is configured
Run ‘tcpdump -e -n host ’ on the server and make sure that requests are arriving with a destination IP of , and leaving again with a source of and a destination of
On the server, run “ip route get from to make sure the host has a non-NATTED return route to the outside.
On the server, run “ip neigh show ”, where is the “via” IP from the previous “ip route get” command, to make sure the host knows how to reach the gateway.

The post Load Balancing and B2 Cloud Storage appeared first on Backblaze Blog | The Life of a Cloud Backup Company.

New Name, New Home for the Let’s Encrypt Client

Post Syndicated from Let's Encrypt - Free SSL/TLS Certificates original https://letsencrypt.org//2016/03/09/le-client-new-home.html

Over the next few months the Let’s Encrypt client will transition to a new name (soon to be announced), and a new home at the Electronic Frontier Foundation (EFF).

The goal of Let’s Encrypt is to make turning on HTTPS as easy as possible. To accomplish that, it’s not enough to fully automate certificate issuance on the certificate authority (CA) side – we have to fully automate on the client side as well. The Let’s Encrypt client is now being used by hundreds of thousands of websites and we expect it to continue to be a popular choice for sites that are run from a single server or VPS.

That said, the web server ecosystem is complex, and it would be impossible for any particular client to serve everyone well. As a result, the Let’s Encrypt community has created dozens of clients to meet many diverse needs. Moving forward, we feel it would be best for Let’s Encrypt to focus on promoting a generally healthy client and protocol ecosystem and for our client to move to the EFF. This will also allow us to focus our engineering efforts on running a reliable and rapidly growing CA server infrastructure.

The Let’s Encrypt client goes further than most other clients in terms of end-to-end automation and extensibility, both getting certificates and in many cases installing them. This is an important strategy since major servers don’t yet have built-in support, and we want to make sure it’s given a proper chance to thrive. The EFF has led development of the Let’s Encrypt client from the beginning, and they are well-qualified to continue pursuing this strategy.

The rename is happening for reasons that go beyond the move to the EFF. One additional reason for the rename is that we want the client to be distributable and customisable without having to create a complex process for deciding whether customized variants are appropriate for use with Let’s Encrypt trademarks. Another reason is that we want it to be clear that the client can work with any ACME-enabled CA in the future, not just Let’s Encrypt.

We expect the client to do well at the EFF and continue to be used by many people to get certificates from Let’s Encrypt.

Bugfixing KVM live migration

Post Syndicated from Michael Chapman original http://www.anchor.com.au/blog/2015/11/bugfixing-kvm-live-migration/

Here at Anchor we really love our virtualization. Our virtualization platform of choice, KVM, lets us provide a variety of different VPS products to meet our customers’ requirements.
Our KVM hosting platform has evolved considerably over the six years it’s been in operation, and we’re always looking at ways we can improve it. One important aspect of this process of continual improvement, and one I am heavily involved in, is the testing of software upgrades before they are rolled out. This post describes a recent problem encountered during this testing, the analysis that led to discovering its cause, and how we have fixed it. Strap yourself in, this might get technical.

The bug’s first sightings
Until now, we have built most of our KVM hosts on Red Hat Enterprise Linux 6 — it’s fast, stable, and supported for a long time. Since the release of RHEL 7 a year ago we have been looking to using it as well, perhaps even to eventually replace all our existing RHEL 6 hypervisors.
Of course, a big change like this can’t be made without a huge amount of testing. One set of tests is to check that “live migration” of virtual machines works correctly, both between RHEL 7 hypervisors and from RHEL 6 to RHEL 7 and back again.
Live migration is a rather complex affair. Before I describe live migration, however, I ought to explain a bit about how KVM works. KVM is itself just a Linux kernel module. It provides access to the underlying hardware’s virtualization extensions, which allows guests to run at near-native speeds without emulation. However, we need to provide our guests with a set of “virtual hardware” — things like a certain number of virtual CPUs, some RAM, some disk space, and any virtual network connections the guest might need. This virtual hardware is provided by software called QEMU.
When live migrating a guest, it is QEMU that performs all the heavy lifting:

QEMU synchronizes any non-shared storage for the guest (the synchronization is maintained for the duration of the migration).
QEMU synchronizes the virtual RAM for the guest across the two hypervisors (again for the duration of the migration). But remember, this is a live migration, which means the guest could be continually changing the contents of RAM and disk, so…
QEMU waits for the amount of “out-of-sync” data to fall below a certain threshold, at which point it pauses the guest (i.e. it turns off the in-kernel KVM component for the guest).
QEMU synchronizes the remaining out-of-sync data, then resumes the guest on the new hypervisor.

Since the guest is only paused while synchronizing a small amount of out-of-sync RAM (and an even smaller amount of disk), we can limit the impact of the migration upon the guest’s operation. We’ve tuned things so that most migrations can be performed with the guest paused for no longer than a second.
So this is where our testing encountered a problem. We had successfully tested live migrations between RHEL 7 hypervisors, as well as from those running RHEL 6 to those running RHEL 7. But when we tried to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 one, something went wrong: the guest remained paused after the migration! What could be the problem?
Some initial diagnosis
The first step in diagnosing any problem is to gather as much information as you can. We have a log file for each of our QEMU processes. Looking at the log file for the QEMU process “receiving” the live migration (i.e. on the target hypervisor) I found this:
KVM: entry failed, hardware error 0x80000021

If you’re running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [—Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00100
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 ffffffff 00c00100
FS =0000 0000000000000000 ffffffff 00c00100
GS =0000 ffff88003fc00000 ffffffff 00c00100
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88003fc09000 0000007f
IDT= ffffffffff574000 00000fff
CR0=8005003b CR2=00007f6bee823000 CR3=000000003d2c0000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

What appears to have happened here is that the entire migration process worked correctly up to the point at which the QEMU process needed to resumed the guest… but when it tried to actually resume the guest, it failed to start properly. QEMU dumps out the guest’s CPU registers when this occurs. “Hardware error 0x80000021” is unfortunately a rather generic error code — it simply means “invalid guest state”. But what could be wrong with the guest state? It was just running a moment ago on the other hypervisor; how did the migration make it invalid, if live migration is supposed to copy every part of the guest state intact?
Given that all of our other migration tests were passing, what I needed to do was compare this “bad” migration with one of the “good” ones. In particular, I wanted to get the very same register dump out of a “good” migration, so that I could compare it with this “bad” migration’s register dump.
QEMU itself does not seem to have the ability to do this (after all, if a migration is successful, why would you need a register dump?), which meant I would have to change the way QEMU works. Rather than patching the QEMU software then and there, I found it easiest to modify its behaviour through GDB. By attaching a debugger to the QEMU process, I could have it stop at just the right moment, dump out the guest’s CPU registers, then continue on as if nothing had occurred:
# gdb -p 8332

(gdb) break kvm_cpu_exec
Breakpoint 1 at 0x7f25ec576050: file /usr/src/debug/qemu-2.4.0/kvm-all.c, line 1788.
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just “end”.
>call cpu_dump_state(cpu, stderr, fprintf, CPU_DUMP_CODE)
>disable 1
>continue
>end
(gdb) continue
Continuing.
[Thread 0x7f2596fff700 (LWP 8339) exited]
[New Thread 0x7f25941ff700 (LWP 8357)]
[New Thread 0x7f2596fff700 (LWP 8410)]
[New Thread 0x7f25939fe700 (LWP 8411)]
[Thread 0x7f25939fe700 (LWP 8411) exited]
[Thread 0x7f2596fff700 (LWP 8410) exited]
[Switching to Thread 0x7f25d8533700 (LWP 8336)]

Breakpoint 1, kvm_cpu_exec ([email protected]=0x7f25ee8cc000) at /usr/src/debug/qemu-2.4.0/kvm-all.c:1788
1788 {
[Switching to Thread 0x7f25d7d32700 (LWP 8337)]

Success! This produced a new register dump:
RAX=ffffffff8101c980 RBX=ffffffff818e2900 RCX=ffffffff81855120 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=0000000000000000 RSP=ffffffff81803ef0
R8 =0000000000000000 R9 =0000000000000000 R10=0000000000000000 R11=0000000000000000
R12=ffffffff81800000 R13=0000000000000000 R14=00000000ffffffed R15=ffffffff81a27000
RIP=ffffffff81051c02 RFL=00000246 [—Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1
ES =0000 0000000000000000 ffffffff 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS [-WA]
DS =0000 0000000000000000 ffffffff 00000000
FS =0000 0000000000000000 ffffffff 00000000
GS =0000 ffff88003fc00000 ffffffff 00000000
LDT=0000 0000000000000000 000fffff 00000000
TR =0040 ffff88003fc10340 00002087 00008b00 DPL=0 TSS64-busy
GDT= ffff88003fc09000 0000007f
IDT= ffffffffff574000 00000fff
CR0=8005003b CR2=00007f0817db3000 CR3=000000003a45d000 CR4=000006f0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=00 00 fb c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 fb f4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 f4 c3 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00

So now I was able to compare this “good” register dump from the previous “bad” one. The most important differences seemed to be related to the “segment registers”:
bad: ES =0000 0000000000000000 ffffffff 00c00100
good: ES =0000 0000000000000000 ffffffff 00000000

bad: DS =0000 0000000000000000 ffffffff 00c00100
good: DS =0000 0000000000000000 ffffffff 00000000

bad: FS =0000 0000000000000000 ffffffff 00c00100
good: FS =0000 0000000000000000 ffffffff 00000000

bad: GS =0000 ffff88003fc00000 ffffffff 00c00100
good: GS =0000 ffff88003fc00000 ffffffff 00000000

bad: LDT=0000 0000000000000000 ffffffff 00c00000
good: LDT=0000 0000000000000000 000fffff 00000000

Those fields at the end contained different values in the “bad” and “good” migrations. Could they be the cause of the “invalid guest state”?
Memory segmentation
To understand what’s going on here, we need to know a bit about how x86 memory segmentation works. Once upon a time, this was really simple: a 16-bit CS (code segment), DS (data segment) or SS (stack segment) register was simply shifted by 4 bits and added to a 16-bit offset in order to form a 20-bit absolute address.
But “protected mode” (introduced in the Intel 80286) complicated things greatly. Instead of a 16-bit segment number, each segment register held:

a 16-bit “segment selector”;
a “base address” for the segment;
the segment’s “size”;
a set of “flags” to keep track of things like whether the segment can be written to, whether the segment is actually present in physical RAM, and so on.

These are the four fields you can see in the segment registers shown above.
But hang on… this guest wasn’t running in “protected mode”. It was a 64-bit guest running a 64-bit operating system; it was running in what’s called “long mode”, and for the most part long mode doesn’t have segmentation. The particular values in the segment registers listed above are mostly irrelevant, because the CPU isn’t actively using those registers.
So at this point I knew that the segment registers had different flags in the “bad” migration than they did in the “good” migration. But if the registers weren’t being used, why would the flags matter?
“Unusable” memory segments
It took a fair bit of trawling through QEMU and kernel source code and Intel’s copious documentation before I found the answer. It turns out that there is a hidden flag, not visible in these register dumps, indicating whether a particular segment is “usable” or not. The usable flags are not part of the register dumps because they’re not really part of a guest’s CPU state; instead, they’re used by a hypervisor to tell the host CPU which of a guest’s segment registers should be loaded when a guest is started — and most importantly, this includes the times a guest is resumed immediately following a migration.
Next up, I needed to see how KVM and QEMU dealt with these “unusable” segments. So long as each register’s “unusable” flag is included in the migration, then the complete guest state should be recoverable after a migration.
Interestingly, it seems that QEMU does not track the “unusable” flag for each segment. The two functions (set_seg and get_seg) responsible for translating between KVM’s and QEMU’s representations of these segment registers would throw away a “unusable” flag when retrieving it from the kernel, and always clear it when loading the register back into the kernel. How could this ever have worked correctly?
This was finally answered when I looked at the kernel versions involved:

On the RHEL 6 kernel, when retrieving a guest’s segment registers the kernel would automatically clear the flags for a segment if the segment was marked “unusable”. When loading the guest’s segment registers again, it would treat a segment with a cleared set of flags as if it were “unusable”, even if QEMU had not said so.
On the RHEL 7 kernel, however, the kernel would not touch the flags at all when they were retrieved. On loading the segment registers again, it would treat a segment as “unusable” only if QEMU said so, or if one specific flag — the “segment is present” flag — were not set.

Although these kernels have different behaviour, they both work correctly if you stick to one kernel in a guest migration. But if you try to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 hypervisor, the flags aren’t cleared and the new kernel doesn’t know the register should be automatically marked unusable. The result is that the guest tries to use an invalid segment register, so the hardware throws an “invalid guest state” error. Bingo — that’s exactly what we’d seen!
The fix
The fix turned out to be quite simple: simply have QEMU also clear the flags of any segment registers that are marked unusable, and have it ensure that segment registers whose “present” flags are cleared are also marked unusable when loading them into the kernel:
diff –git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..588df76 100644
— a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -997,7 +997,7 @@ static void set_seg(struct kvm_segment *lhs, const SegmentCache *rhs)
lhs->l = (flags >> DESC_L_SHIFT) & 1;
lhs->g = (flags & DESC_G_MASK) != 0;
lhs->avl = (flags & DESC_AVL_MASK) != 0;
– lhs->unusable = 0;
+ lhs->unusable = !lhs->present;
lhs->padding = 0;
}

@@ -1006,14 +1006,18 @@ static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs)
lhs->selector = rhs->selector;
lhs->base = rhs->base;
lhs->limit = rhs->limit;
– lhs->flags = (rhs->type <present * DESC_P_MASK) |
– (rhs->dpl <db <s * DESC_S_MASK) |
– (rhs->l <g * DESC_G_MASK) |
– (rhs->avl * DESC_AVL_MASK);
+ if (rhs->unusable) {fix
+ lhs->flags = 0;
+ } else {
+ lhs->flags = (rhs->type <present * DESC_P_MASK) |
+ (rhs->dpl <db <s * DESC_S_MASK) |
+ (rhs->l <g * DESC_G_MASK) |
+ (rhs->avl * DESC_AVL_MASK);
+ }
}

static void kvm_getput_reg(__u64 *kvm_reg, target_ulong *qemu_reg, int set)

With both of these changes in place, a migration would work even if we were migrating to or from an “old” version of QEMU without the fix. Moreover, it would mean we could get the fix rolled out without having to change the kernels involved.
At present we are still testing these changes, however we look forward to working with the upstream QEMU developers in order to have them added to the mainline version of QEMU.
In writing this blog post I’ve skipped over many of the dead-ends I took in solving this problem. While the fix ended up reasonably straight-forward (well, as much as can be expected when you’re dealing with kernels and hypervisors) it was a fun and educational journey getting there.
Got a question or comment? We’d love to hear from you!
The post Bugfixing KVM live migration appeared first on Anchor Cloud Hosting.