Tag Archives: Data Center

Cloudflare Network expands to more than 100 Countries

Post Syndicated from Kevin Dompig original https://blog.cloudflare.com/cloudflare-network-expands-to-more-than-100-countries/

Cloudflare Network expands to more than 100 Countries

Cloudflare Network expands to more than 100 Countries

2020 has been a historic year that will forever be associated with the COVID-19 pandemic. Over the past six months, we have seen societies, businesses, and entire industries unsettled. The situation at Cloudflare has been no different. And while this pandemic has affected each and every one of us, we here at Cloudflare have not forgotten what our mission is: to help build a better Internet.

We have expanded our global network to 206 cities across more than 100 countries. This is in addition to completing 40+ datacenter expansion projects and adding over 1Tbps in dedicated “backbone” (transport) capacity connecting our major data centers so far this year.

Pandemic times means new processes

There was zero chance that 2020 would mean business as usual within the Infrastructure department. We were thrown a curve-ball as the pandemic began affecting our supply chains and operations. By April, the vast majority of the world’s passenger flights were grounded. The majority of bulk air freight ships within the lower deck (“belly”) of these flights, which saw an imbalance between supply and demand with the sudden 74% decrease in passenger belly cargo capacity relative to the same period last year.

We were fortunate to have existing logistics partners who were involved in medical equipment distribution, subsequently offering to help us maintain our important global infrastructure cargo flows. In one example, our logistics partner Expeditors , already operating with limited staff to limit the risk of exposure, went above and beyond securing us space on  “freight only” converted passenger aircraft flights from  Taipei.

Six new cities

Cloudflare Network expands to more than 100 Countries

New cities represent more than a new dot on the map for Cloudflare. These cities represent unique partnerships with ISPs all over the world which allow Cloudflare to bring the Internet closer to an ISP’s end-users and increase our edge compute capabilities in the region.

When we enter into a partnership with an ISP it is also a commitment to that ISP, and to our customers, to increase the security, performance, and reliability of the more than 27 million Internet properties on our network. These accomplishments would not be possible if not for our partners and the individuals who make up Cloudflare’s Infrastructure and Network Operations teams.

Of the six new cities we’ve turned up this year five represent new countries for Cloudflare, including one very close to my heart.

  • Vientiane, Laos*. Our first data center in Laos, a country where accessible high-speed Internet has only been available for a few decades. In the last two decades, there has been an exponential growth of Laos Internet users increasing from just under 6,000 residents in 2000 to over 1,000,000 residents when 4G LTE service launched in 2015.
  • Tegucigalpa, Honduras*. The nation’s capital, Tegucigalpa is the most populated city in Honduras. This also marks our first data center in-country. Today 41% of the population are active Internet users which is the lowest of all countries in Central America and we are excited to help bring faster and more reliable Internet to the people of Honduras.
  • Johor Bahru, Malaysia. Our second data center in Malaysia is located in the second-largest city in the country. Johor Bahru serves as a gateway between Singapore and Malaysia and we are proud to be a participant on the Johor Bahru Internet Exchange (JBIX), the second IX to launch in Malaysia. Turning up this site was a challenge for our deployment partners, who went above and beyond to install and provision during the beginning stages of nationwide COVID-19 lockdowns.
  • Monrovia, Liberia*. Our 15th data center on the continent of Africa shares a history with the United States. Liberia as we know it today first began as a settlement for freed slaves from the United States almost 200 years ago. Monrovia, the nation’s capital, was the largest settlement and today makes up roughly one-third of the population of Liberia.
  • Bandar Seri Begawan, Brunei*. Although Brunei is the second smallest country in Southeast Asia by land area (only behind Singapore), and the smallest country by population, Brunei is currently ranked 4th in the world for the highest share of social media users coming in at 94% of the population.
  • Paramaribo, Suriname*. While Suriname is one of the smallest countries in South America it is one of the most diverse countries in all of the Americas. 80% of the country is covered by tropical rainforest and 70% of the nation’s population resides in just two urban districts which only account for 0.5% of the country’s total land. This also happens to be my home country where my entire family is from, and where I was born.

* = New Country for Cloudflare’s global network

Cloudflare network expansion hits home for our team

Cloudflare Network expands to more than 100 Countries
50th-percentile performance improvement versus other edge networks when Surinamese traffic started being served in-country. // Source: Cedexis

When I shared the news with my Surinamese family that Cloudflare had turned up a data center in Suriname they were extremely proud. More importantly, they are excited at the opportunity to see increased performance and reliability on the Internet. When my family immigrated to the United States more than 30 years ago, one of the hardships of establishing this new life was communicating with loved ones back home.

Even in the advent of global communication via social media, challenges still persist. Having access to reliable and secure Internet is still a luxury in many parts of the world. Suriname is no different. For many years my family relied on wireless carriers for reliable Internet connectivity. That has changed significantly in the last decade with improvements to the Internet’s infrastructure in Suriname. One big challenge is the physical distance between the Internet users in Suriname and where the content (servers) are located.

When we partner with a local ISP in-country it allows us to shorten that distance and bring the Internet closer. I am proud to work for a company that is helping deliver better access to  Internet users like my family in Suriname and millions of others around the world.

Cloudflare is still growing, even when everyone is working from home

During these times communicating over the Internet has become essential for everyone. As more businesses and people are shifting to Internet-based operations our work is more critical than ever. We’re only halfway through the year and have a lot more work to do. This means we are continuing to grow our company by hiring and offering fully remote onboarding. Check out our careers page for more information on full-time positions and internship roles across the globe.

Introducing Regional Services

Post Syndicated from Achiel van der Mandele original https://blog.cloudflare.com/introducing-regional-services/

Introducing Regional Services

In a world where, increasingly, workloads shift to the cloud, it is often uncertain and unclear how data travels the Internet and in which countries data is processed. Today, Cloudflare is pleased to announce that we’re giving our customers control. With Regional Services, we’re providing customers full control over exactly where their traffic is handled.

We operate a global network spanning more than 200 cities. Each data center runs servers with the exact same software stack. This has enabled Cloudflare to quickly and efficiently add capacity where needed. It also allows our engineers to ship features with ease: deploy once and it’s available globally.

The same benefit applies to our customers: configure once and that change is applied everywhere in seconds, regardless of whether they’re changing security features, adding a DNS record or deploying a Cloudflare Worker containing code.

Having a homogenous network is great from a routing point of view: whenever a user performs an HTTP request, the closest datacenter is found due to Cloudflare’s Anycast network. BGP looks at the hops that would need to be traversed to find the closest data center. This means that someone near the Canadian border (let’s say North Dakota) could easily find themselves routed to Winnipeg (inside Canada) instead of a data center in the United States. This is generally what our customers want and expect: find the fastest way to serve traffic, regardless of geographic location.

Some organizations, however, have expressed preferences for maintaining regional control over their data for a variety of reasons. For example, they may be bound by agreements with their own customers that include geographic restrictions on data flows or data processing. As a result, some customers have requested control over where their web traffic is serviced.

Regional Services gives our customers the ability to accommodate regional restrictions while still using Cloudflare’s global edge network. As of today, Enterprise customers can add Regional Services to their contracts. With Regional Services, customers can choose which subset of data centers are able to service traffic on the HTTP level. But we’re not reducing network capacity to do this: that would not be the Cloudflare Way. Instead, we’re allowing customers to use our entire network for DDoS protection but limiting the data centers that apply higher-level layer 7 security and performance features such as WAF, Workers, and Bot Management.

Traffic is ingested on our global Anycast network at the location closest to the client, as usual, and then passed to data centers inside the geographic region of the customer’s choice. TLS keys are only stored and used to actually handle traffic inside that region. This gives our customers the benefit of our huge, low-latency, high-throughput network, capable of withstanding even the largest DDoS attacks, while also giving them local control: only data centers inside a customer’s preferred geographic region will have the access necessary to apply security policies.

The diagram below shows how this process works. When users connect to Cloudflare, they hit the closest data center to them, by nature of our Anycast network. That data center detects and mitigates DDoS attacks. Legitimate traffic is passed through to a data center with the geographic region of the customers choosing. Inside that data center, traffic is inspected at OSI layer 7 and HTTP products can work their magic:

  • Content can be returned from and stored in cache
  • The WAF looks inside the HTTP payloads
  • Bot Management detects and blocks suspicious activity
  • Workers scripts run
  • Access policies are applied
  • Load Balancers look for the best origin to service traffic
Introducing Regional Services

Today’s launch includes preconfigured geographic regions; we’ll look to add more depending on customer demand. Today, US and EU regions are available immediately, meaning layer 7 (HTTP) products can be configured to only be applied within those regions and not outside of them.

The US and EU maps are depicted below. Purple dots represent data centers that apply DDoS protection and network acceleration. Orange dots represent data centers that process traffic.

US

Introducing Regional Services

EU

Introducing Regional Services

We’re very excited to provide new tools to our customers, allowing them to dictate which of our data centers employ HTTP features and which do not. If you’re interested in learning more, contact [email protected]

Securing Memory at EPYC Scale

Post Syndicated from Derek Chamorro original https://blog.cloudflare.com/securing-memory-at-epyc-scale/

Securing Memory at EPYC Scale

Securing Memory at EPYC Scale

Security is a serious business, one that we do not take lightly at Cloudflare. We have invested a lot of effort into ensuring that our services, both external and internal, are protected by meeting or exceeding industry best practices. Encryption is a huge part of our strategy as it is embedded in nearly every process we have. At Cloudflare, we encrypt data both in transit (on the network) and at rest (on the disk). Both practices address some of the most common vectors used to exfiltrate information and these measures serve to protect sensitive data from attackers but,  what about data currently in use?

Can encryption or any technology eliminate all threats? No, but as Infrastructure Security, it’s our job to consider worst-case scenarios. For example, what if someone were to steal a server from one of our data centers? How can we leverage the most reliable, cutting edge, innovative technology to secure all data on that host if it were in the wrong hands? Would it be protected? And, in particular, what about the server’s RAM?

Securing Memory at EPYC Scale

Data in random access memory (RAM) is usually stored in the clear. This can leave data vulnerable to software or hardware probing by an attacker on the system. Extracting data from memory isn’t an easy task but, with the rise of persistent memory technologies, additional attack vectors are possible:

  • Dynamic random-access memory (DRAM) interface snooping
  • Installation of hardware devices that access host memory
  • Freezing and stealing dual in-line memory module (DIMMs)
  • Stealing non-volatile dual in-line memory module (NVDIMMs)

So, what about enclaves? Hardware manufacturers have introduced Trusted Execution Environments (also known as enclaves) to help create security boundaries by isolating software execution at runtime so that sensitive data can be processed in a trusted environment, such as secure area inside an existing processor or Trusted Platform Module.

While this allows developers to shield applications in untrusted environments, it doesn’t effectively address all of the physical system attacks mentioned previously. Enclaves were also meant to run small pieces of code. You could run an entire OS in an enclave, but there are limitations and performance issues in doing so.

This isn’t meant to bash enclave usage; we just wanted a better solution for encrypting all memory at scale. We expected performance to be compromised, and conducted tests to see just how much.

Time to get EPYC

Since we are using AMD for our tenth generation “Gen X servers”, we found an interesting security feature within the System on a Chip architecture of the AMD EPYC line. Secure Memory Encryption (SME) is an x86 instruction set extension introduced by AMD and available in the EPYC processor line. SME provides the ability to mark individual pages of memory as encrypted using standard x86 page tables. A page that is marked encrypted will be automatically decrypted when read from DRAM and encrypted when written to DRAM. SME can therefore be used to protect the contents of DRAM from physical attacks on the system.

Sounds complicated, right? Here’s the secret: It isn’t 😀

Components

SME is comprised of two components:

  • AES-128 encryption engine: Embedded in the memory controller. It is responsible for encrypting and decrypting data in main memory when an appropriate key is provided via the Secure Processor.
  • AMD Secure Processor (AMD-SP): An on-die 32-bit ARM Cortex A5 CPU that provides cryptographic functionality for secure key generation and key management. Think of this like a mini hardware security module that uses a hardware random number generator to generate the 128-bit key(s) used by the encryption engine.

How It Works

We had two options available to us when it came to enabling SME. The first option, regular SME, requires enabling a model specific register MSR 0xC001_0010[SMEE]. This enables the ability to set a page table entry encryption bit:

  • 0 = memory encryption features are disabled
  • 1 = memory encryption features are enabled

After memory encryption is enabled, a physical address bit (C-Bit) is utilized to mark if a memory page is protected. The operating system sets the bit of a physical address to 1 in the page table entry (PTE) to indicate the page should be encrypted. This causes any data assigned to that memory space to be automatically encrypted and decrypted by the AES engine in the memory controller:

Securing Memory at EPYC Scale

Becoming More Transparent

While arbitrarily flagging which page table entries we want encrypted is nice, our objective is to ensure that we are incorporating the full physical protection of SME. This is where the second mode of SME came in, Transparent SME (TSME). In TSME, all memory is encrypted regardless of the value of the encrypt bit on any particular page. This includes both instruction and data pages, as well as the pages corresponding to the page tables themselves.

Enabling TSME is as simple as:

  1. Setting a BIOS flag:
Securing Memory at EPYC Scale

2. Enabling kernel support with the following flag:

CONFIG_AMD_MEM_ENCRYPT=y

After a reboot you should see the following in dmesg:

$ sudo dmesg | grep SME
[    2.537160] AMD Secure Memory Encryption (SME) active

Performance Testing

To weigh the pros and cons of implementation against the potential risk of a stolen server, we had to test the performance of enabling TSME. We took a test server that mirrored a production edge metal with the following specs:

  • Memory: 8 x 32GB 2933MHz
  • CPU: AMD 2nd Gen EPYC 7642 with SMT enabled and running NPS4 mode
  • OS: Debian 9
  • Kernel: 5.4.12

The performance tools we used were:

Stream

We used a custom STREAM binary with 24 threads, using all available cores, to measure the sustainable memory bandwidth (in MB/s). Four synthetic computational kernels are run, with the output of each kernel being used as an input to the next. The best rates observed are reported for each choice of thread count.

Securing Memory at EPYC Scale
Securing Memory at EPYC Scale

The figures above show 2.6% to 4.2% performance variation, with a mean of 3.7%. These were the highest numbers measured, which fell below an expected performance impact of >5%.

Cryptsetup

While cryptsetup is normally used for encrypting disk partitions, it has a benchmarking utility that will report on a host’s cryptographic performance by iterating key derivation functions using memory only:

Example:

$ sudo cryptsetup benchmark
# Tests are approximate using memory only (no storage IO).
PBKDF2-sha1      1162501 iterations per second for 256-bit key
PBKDF2-sha256    1403716 iterations per second for 256-bit key
PBKDF2-sha512    1161213 iterations per second for 256-bit key
PBKDF2-ripemd160  856679 iterations per second for 256-bit key
PBKDF2-whirlpool  661979 iterations per second for 256-bit key

Securing Memory at EPYC Scale
Securing Memory at EPYC Scale

Benchmarky

Benchmarky is a homegrown tool provided by our Performance team to run synthetic workloads against a specific target to evaluate performance of different components. It uses Cloudflare Workers to send requests and read stats on responses. In addition to that, it also reports versions of all important stack components and their CPU usage. Each test runs 256 concurrent clients, grabbing a cached 10kB PNG image from a performance testing endpoint, and calculating the requests per second (RPS).

Securing Memory at EPYC Scale
Securing Memory at EPYC Scale

Conclusion

In the majority of test results, performance decreased by a nominal amount, actually less than we expected. AMD’s official white paper on SME even states that encryption and decryption of memory through the AES engine does incur a small amount of additional latency for DRAM memory accesses, though dependent on the workload. Across all 11 data points, our average performance drag was only down by .699%. Even at scale, enabling this feature has reduced the worry that any data could be exfiltrated from a stolen server.

While we wait for other hardware manufacturers to add support for total memory encryption, we are happy that AMD has set the bar high and is protecting our next generation edge hardware.

Impact of Cache Locality

Post Syndicated from Sung Park original https://blog.cloudflare.com/impact-of-cache-locality/

Impact of Cache Locality

Impact of Cache Locality

In the past, we didn’t have the opportunity to evaluate as many CPUs as we do today. The hardware ecosystem was simple – Intel had consistently delivered industry leading processors. Other vendors could not compete with them on both performance and cost. Recently it all changed: AMD has been challenging the status quo with their 2nd Gen EPYC processors.

This is not the first time that Intel has been challenged; previously there was Qualcomm, and we worked with AMD and considered their 1st Gen EPYC processors and based on the original Zen architecture, but ultimately, Intel prevailed. AMD did not give up and unveiled their 2nd Gen EPYC processors codenamed Rome based on the latest Zen 2 architecture.


This made many improvements over its predecessors. Improvements include a die shrink from 14nm to 7nm, a doubling of the top end core count from 32 to 64, and a larger L3 cache size. Let’s emphasize again on the size of that L3 cache, which is 32 MiB L3 cache per Core Complex Die (CCD).

This time around, we have taken steps to understand our workloads at the hardware level through the use of hardware performance counters and profiling tools. Using these specialized CPU registers and profilers, we collected data on the AMD 2nd Gen EPYC and Intel Skylake-based Xeon processors in a lab environment, then validated our observations in production against other generations of servers from the past.

Simulated Environment

CPU Specifications

Impact of Cache Locality

We evaluated several Intel Cascade Lake and AMD 2nd Gen EPYC processors, trading off various factors between power and performance; the AMD EPYC 7642 CPU came out on top. The majority of Cascade Lake processors have 1.375 MiB L3 cache per core shared across all cores, a common theme that started with Skylake. On the other hand, the 2nd Gen EPYC processors start at 4 MiB per core. The AMD EPYC 7642 is a unique SKU since it has 256 MiB of L3 cache shared across its 48 cores. Having a cache this large or approximately 5.33 MiB sitting right next to each core means that a program will spend fewer cycles fetching data from RAM with the capability to have more data readily available in the L3 cache.

Impact of Cache Locality
Before (Intel)
Impact of Cache Locality
After (AMD)

Traditional cache layout has also changed with the introduction of 2nd Gen EPYC, a byproduct of AMD using a multi-chip module (MCM) design. The 256 MiB L3 cache is formed by 8 individual dies or Core Complex Die (CCD) that is formed by 2 Core Complexes (CCX) with each CCX containing 16 MiB of L3 cache.

Impact of Cache Locality
Core Complex (CCX) – Up to four cores
Impact of Cache Locality
Core Complex Die (CCD) – Created by combining two CCXs
Impact of Cache Locality
AMD 2nd Gen EPYC 7642 – Created with 8x CCDs plus an I/O die in the center

Methodology

Our production traffic shares many characteristics of a sustained workload which typically does not induce large variation in operating frequencies nor enter periods of idle time. We picked out a simulated traffic pattern that closely resembled our production traffic behavior which was the Cached 10KiB png via HTTPS. We were interested in assessing the CPU’s maximum throughput or requests per second (RPS), one of our key metrics. With that being said, we did not disable Intel Turbo Boost or AMD Precision Boost, nor matched the frequencies clock-for-clock while measuring for requests per second, instructions retired per second (IPS), L3 cache miss rate, and sustained operating frequency.

Results

The 1P AMD 2nd Gen EPYC 7642 powered server took the lead and processed 50% more requests per second compared to our Gen 9’s 2P Intel Xeon Platinum 6162 server.

Impact of Cache Locality

We are running a sustained workload, so we should end up with a sustained operating frequency that is higher than base clock. The AMD EPYC 7642 operating frequency or the number cycles that the processor had at its disposal was approximately 20% greater than the Intel Xeon Platinum 6162, so frequency alone was not enough to explain the 50% gain in requests per second.

Impact of Cache Locality

Taking a closer look, the number of instructions retired over time was far greater on the AMD 2nd Gen EPYC 7642 server, thanks to its low L3 cache miss rate.

Impact of Cache Locality
Impact of Cache Locality

Production Environment

CPU Specifications

Impact of Cache Locality

Methodology

Our most predominant bottleneck appears to be the cache memory and we saw significant improvement in requests per second as well as time to process a request due to low L3 cache miss rate. The data we present in this section was collected at a point of presence that spanned between Gen 7 to Gen 9 servers. We also collected data from a secondary region to gain additional confidence that the data we present here was not unique to one particular environment. Gen 9 is the baseline just as we have done in the previous section.

We put the 2nd Gen EPYC-based Gen X into production with hopes that the results would mirror closely to what we have previously seen in the lab. We found that the requests per second did not quite align with the results we had hoped, but the AMD EPYC server still outperformed all previous generations including outperforming the Intel Gen 9 server by 36%.

Impact of Cache Locality

Sustained operating frequency was nearly identical to what we have seen back in the lab.

Impact of Cache Locality

Due to the lower than expected requests per second, we also saw lower instructions retired over time and higher L3 cache miss rate but maintained a lead over Gen 9, with 29% better performance.

Impact of Cache Locality
Impact of Cache Locality

Conclusion

The single AMD EPYC 7642 performed very well during our lab testing, beating our Gen 9 server with dual Intel Xeon Platinum 6162 with the same total number of cores. Key factors we noticed were its large L3 cache, which led to a low L3 cache miss rate, as well as a higher sustained operating frequency. The AMD 2nd Gen EPYC 7642 did not have as big of an advantage in production, but nevertheless still outperformed all previous generations. The observation we made in production was based on a PoP that could have been influenced by a number of other factors such as but not limited to ambient temperature, timing, and other new products that will shape our traffic patterns in the future such as WebAssembly on Cloudflare Workers. The AMD EPYC 7642 opens up the possibility for our upcoming Gen X server to maintain the same core count while processing more requests per second than its predecessor.

Got a passion for hardware? I think we should get in touch. We are always looking for talented and curious individuals to join our team. The data presented here would not have been possible if it was not for the teamwork between many different individuals within Cloudflare. As a team, we strive to work together to create highly performant, reliable, and secure systems that will form the pillars of our rapidly growing network that spans 200 cities in more than 90 countries and we are just getting started.

An EPYC trip to Rome: AMD is Cloudflare’s 10th-generation Edge server CPU

Post Syndicated from Rob Dinh original https://blog.cloudflare.com/an-epyc-trip-to-rome-amd-is-cloudflares-10th-generation-edge-server-cpu/

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

More than 1 billion unique IP addresses pass through the Cloudflare Network each day, serving on average 11 million HTTP requests per second and operating within 100ms of 95% of the Internet-connected population globally. Our network spans 200 cities in more than 90 countries, and our engineering teams have built an extremely fast and reliable infrastructure.

We’re extremely proud of our work and are determined to help make the Internet a better and more secure place. Cloudflare engineers who are involved with hardware get down to servers and their components to understand and select the best hardware to maximize the performance of our stack.

Our software stack is compute intensive and is very much CPU bound, driving our engineers to work continuously at optimizing Cloudflare’s performance and reliability at all layers of our stack. With the server, a straightforward solution for increasing computing power is to have more CPU cores. The more cores we can include in a server, the more output we can expect. This is important for us since the diversity of our products and customers has grown over time with increasing demand that requires our servers to do more. To help us drive compute performance, we needed to increase core density and that’s what we did. Below is the processor detail for servers we’ve deployed since 2015, including the core counts:

Gen 6Gen 7Gen 8Gen 9
Start of service2015201620172018
CPUIntel Xeon E5-2630 v3Intel Xeon E5-2630 v4Intel Xeon Silver 4116Intel Xeon Platinum 6162
Physical Cores2 x 82 x 102 x 122 x 24
TDP2 x 85W2 x 85W2 x 85W2 x 150W
TDP per Core10.65W8.50W7.08W6.25W

In 2018, we made a big jump in total number of cores per server with Gen 9. Our physical footprint was reduced by 33% compared to Gen 8, giving us increased capacity and computing power per rack. Thermal Design Power (TDP aka typical power usage) are mentioned above to highlight that we’ve also been more power efficient over time. Power efficiency is important to us: first, because we’d like to be as carbon friendly as we can; and second, so we can better utilize our provisioned power supplied by the data centers. But we know we can do better.

Our main defining metric is Requests per Watt. We can increase our Requests per Second number with more cores, but we have to stay within our power budget envelope. We are constrained by the data centers’ power infrastructure which, along with our selected power distribution units, leads us to power cap for each server rack. Adding servers to a rack obviously adds more power draw increasing power consumption at the rack level. Our Operational Costs significantly increase if we go over a rack’s power cap and have to provision another rack. What we need is more compute power inside the same power envelope which will drive a higher (better) Requests per Watt number – our key metric.

As you might imagine, we look at power consumption carefully in the design stage. From the above you can see that it’s not worth the time for us to deploy more power-hungry CPUs if TDP per Core is higher than our current generation which would hurt our Requests per Watt metric. As we started looking at production ready systems to power our Gen X solution, we took a long look at what is available to us in the market today and we’ve made our decision. We’re moving on from Gen 9’s 48-core setup of dual socket Intel® Xeon® Platinum 6162‘s to a 48-core single socket AMD EPYC™ 7642.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
Gen X server setup with single socket 48-core AMD EPYC 7642

IntelAMD
CPUXeon Platinum 6162EPYC 7642
Microarchitecture“Skylake”“Zen 2”
Codename“Skylake SP”“Rome”
Process14nm7nm
Physical Cores2 x 2448
Frequency1.9 GHz2.4 GHz
L3 Cache / socket24 x 1.375MiB16 x 16MiB
Memory / socket6 channels, up to DDR4-24008 channels, up to DDR4-3200
TDP2 x 150W225W
PCIe / socket48 lanes128 lanes
ISAx86-64x86-64

From the specs, we see that with the AMD chip we get to keep the same amount of cores and lower TDP. Gen 9’s TDP per Core was 6.25W, Gen X’s will be 4.69W… That’s a 25% decrease. With higher frequency, and perhaps going to a more simplified setup of single socket, we can speculate that the AMD chip will perform better. We’re walking through a series of tests, simulations, and live production results in the rest of this blog to see how much better AMD performs.

As a side note before we go further, TDP is a simplified metric from the manufacturers’ datasheets that we use in the early stages of our server design and CPU selection process. A quick Google search leads to thoughts that AMD and Intel define TDP differently, which basically makes the spec unreliable. Actual CPU power draw, and more importantly server system power draw, are what we really factor in our final decisions.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Ecosystem Readiness

At the beginning of our journey to choose our next CPU, we got a variety of processors from different vendors that could fit well with our software stack and services, which are written in C, LuaJIT, and Go. More details about benchmarking for our stack were explained when we benchmarked Qualcomm’s ARM® chip in the past. We’re going to go through the same suite of tests as Vlad’s blog this time around since it is a quick and easy “sniff test”. This allows us to test a bunch of CPUs within a manageable time period before we commit to spend more engineering effort and need to apply our software stack.

We tried a variety of CPUs with different number of cores, sockets, and frequencies. Since we’re explaining how we chose the AMD EPYC 7642, all of the graphs in this blog focus on how AMD compares with our Gen 9’s Intel Xeon Platinum 6162 CPU as a baseline.

Our results correspond to server node for both CPUs tested; meaning the numbers pertain to 2x 24-core processors for Intel, and 1x 48-core processor for AMD – a two socket Intel based server and a one socket AMD EPYC powered server. Before we started our testing, we changed the Cloudflare lab test servers’ BIOS settings to match our production server settings. This gave us CPU frequencies yields for AMD at 3.03 Ghz and Intel at 2.50 Ghz on average with very little variation. With gross simplification, we expect that with the same amount of cores AMD would perform about 21% better than Intel. Let’s start with our crypto tests.

Cryptography

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Looking promising for AMD. In public key cryptography, it does 18% better. Meanwhile for symmetric key, AMD loses on AES-128-GCM but it’s comparable overall.

Compression

We do a lot of compression at the edge to save bandwidth and help deliver content faster. We go through both zlib and brotli libraries written in C. All tests are done on blog.cloudflare.com HTML file in memory.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

AMD wins by an average of 29% using gzip across all qualities. It does even better with brotli with tests lower than quality 7, which we use for dynamic compression. There’s a throughput cliff starting brotli-9 which Vlad’s explanation is that Brotli consumes lots of memory and thrashes cache. Nevertheless, AMD wins by a healthy margin.

A lot of our services are written in Go. In the following graphs we’re redoing the crypto and compression tests in Go along with RegExp on 32KB strings and the strings library.

Go Cryptography

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Go Compression

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Go Regexp

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Go Strings

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

AMD performs better in all of our Go benchmarks except for ECDSA P256 Sign losing by 38%, which is peculiar since with the test in C it does 24% better. It’s worth investigating what’s going on here. Other than that, AMD doesn’t win by as much of a margin but it still proves to be better.

LuaJIT

We rely a lot on LuaJIT in our stack. As Vlad said, it’s the glue that holds Cloudflare together. We’re glad to show that AMD wins here as well.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Overall our tests show a single EPYC 7642 to be more competitive than two Xeon Platinum 6162. While there are a couple of tests where AMD loses out such as OpenSSL AES-128-GCM and Go OpenSSL ECDSA-P256 Sign, AMD wins in all the others. By scanning quickly and treating all tests equally, AMD does on average 25% better than Intel.

Performance Simulations

After our ‘sniff’ tests, we put our servers through another series of emulations which apply synthetic workloads simulating our edge software stack. Here we are simulating workloads of scenarios with different types of requests we see in production. Types of requests vary from asset size, whether they go through HTTP or HTTPS, WAF, Workers, or one of many additional variables. Below shows the throughput comparison between the two CPUs of the types of requests we see most typically.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

The results above are ratios using Gen 9’s Intel CPUs as the baseline normalized at 1.0 on the X-axis. For example, looking at simple requests of 10KiB assets over HTTPS, we see that AMD does 1.50x better than Intel in Requests per Second. On average for the tests shown on the graph above, AMD performs 34% better than Intel. Considering that the TDP for the single AMD EPYC 7642 is 225W, when compared to two Intel’s being 300W, we’re looking at AMD delivering up to 2.0x better Requests per Watt vs. the Intel CPUs!

By this time, we were already leaning heavily toward a single socket setup with AMD EPYC 7642 as our CPU for Gen X. We were excited to see exactly how well AMD EPYC servers would do in production, so we immediately shipped a number of the servers out to some of our data centers.

Live Production

Step one of course was to get all our test servers set up for a production environment. All of our machines in the fleet are loaded with the same processes and services which makes for a great apples-to-apples comparison.  Like data centers everywhere, we have multiple generations of servers deployed and we deploy our servers in clusters such that each cluster is pretty homogeneous by server generation. In some environments this can lead to varying utilization curves between clusters.  This is not the case for us. Our engineers have optimized CPU utilization across all server generations so that no matter if the machine’s CPU has 8 cores or 24 cores, CPU usage is generally the same.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

As you can see above and to illustrate our ‘similar CPU utilization’ comment, there is no significant difference in CPU usage between Gen X AMD powered servers and Gen 9 Intel based servers. This means both test and baseline servers are equally loaded. Good. This is exactly what we want to see with our setup, to have a fair comparison. The 2 graphs below show the comparative number of requests processed at the CPU single core and all core (server) level.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

We see that AMD does on average about 23% more requests. That’s really good! We talked a lot about bringing more muscle in the Gen 9 blog. We have the same number of cores, yet AMD does more work, and does it with less power. Just by looking at the specs for number of cores and TDP in the beginning, it’s really nice to see that AMD also delivers significantly more performance with better power efficiency.

But as we mentioned earlier, TDP isn’t a standardized spec across manufacturers so let’s look at real power usage below. Measuring server power consumption along with requests per second (RPS) yields the graph below:

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Observing our servers request rate over their power consumption, the AMD Gen X server performs 28% better. While we could have expected more out of AMD since its TDP is 25% lower, keep in mind that TDP is very ambiguous. In fact, we saw that AMD actual power draw ran nearly at spec TDP with its much higher than base frequency;  Intel was far from it. Another reason why TDP is becoming a less reliable estimate of power draw. Moreover, CPU is just one component contributing to the overall power of the system. Let’s remind that Intel CPUs are integrated in a multi-node system as described in the Gen 9 blog, while AMD is in a regular 1U form-factor machine. That actually doesn’t favor AMD since multi-node systems are designed for high density capabilities at lower power per node, yet it still outperformed the Intel system on a power per node basis anyway.

Through the majority of comparisons from the datasheets, test simulations, and live production performance, the 1P AMD EPYC 7642 configuration performed significantly better than the 2P Intel Xeon 6162. We’ve seen in some environments that AMD can do up to 36% better in live production and we believe we can achieve that consistently with some optimization on both our hardware and software.

So that’s it. AMD wins.

The additional graphs below show the median and p99 NGINX processing mostly on-CPU time latencies between the two CPUs throughout 24 hours. On average, AMD processes about 25% faster. At p99, it does about 20-50% depending on the time of day.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU

Conclusion

Hardware and Performance engineers at Cloudflare do significant research and testing to figure out the best server configuration for our customers. Solving big problems like this is why we love working here, and we’re also helping solving yours with our services like serverless edge compute and the array of security solutions such as Magic Transit, Argo Tunnel, and DDoS protection. All of our servers on the Cloudflare Network are designed to make our products work reliably, and we strive to make each new generation of our server design better than its predecessor. We believe the AMD EPYC 7642 is the answer for our Gen X’s processor question.

With Cloudflare Workers, developers have enjoyed deploying their applications to our Network, which is ever expanding across the globe. We’ve been proud to empower our customers by letting them focus on writing their code while we are managing the security and reliability in the cloud. We are now even more excited to say that their work will be deployed on our Gen X servers powered by 2nd Gen AMD EPYC processors.

An EPYC trip to Rome: AMD is Cloudflare's 10th-generation Edge server CPU
Expanding Rome to a data center near you

Thanks to AMD, using the EPYC 7642 allows us to increase our capacity and expand into more cities easier. Rome wasn’t built in one day, but it will be very close to many of you.

In the last couple of years, we’ve been experimenting with many Intel and AMD x86 chips along with ARM CPUs. We look forward to having these CPU manufacturers partner with us for future generations so that together we can help build a better Internet.

Cloudflare’s Gen X: Servers for an Accelerated Future

Post Syndicated from Nitin Rao original https://blog.cloudflare.com/cloudflares-gen-x-servers-for-an-accelerated-future/

Cloudflare’s Gen X: 
Servers for an Accelerated Future

“Every server can run every service.”

Cloudflare’s Gen X: 
Servers for an Accelerated Future

We designed and built Cloudflare’s network to be able to grow capacity quickly and inexpensively; to allow every server, in every city, to run every service; and to allow us to shift customers and traffic across our network efficiently. We deploy standard, commodity hardware, and our product developers and customers do not need to worry about the underlying servers. Our software automatically manages the deployment and execution of our developers’ code and our customers’ code across our network. Since we manage the execution and prioritization of code running across our network, we are both able to optimize the performance of our highest tier customers and effectively leverage idle capacity across our network.

An alternative approach might have been to run several fragmented networks with specialized servers designed to run specific features, such as the Firewall, DDoS protection or Workers. However, we believe that approach would have resulted in wasted idle resources and given us less flexibility to build new software or adopt the newest available hardware. And a single optimization target means we can provide security and performance at the same time.

We use Anycast to route a web request to the nearest Cloudflare data center (from among 200 cities), improving performance and maximizing the surface area to fight attacks.

Once a datacenter is selected, we use Unimog, Cloudflare’s custom load balancing system, to dynamically balance requests across diverse generations of servers. We load balance at different layers: between cities, between physical deployments located across a city, between external Internet ports, between internal cables, between servers, and even between logical CPU threads within a server.

As demand grows, we can scale out by simply adding new servers, points of presence (PoPs), or cities to the global pool of available resources. If any server component has a hardware failure, it is gracefully de-prioritized or removed from the pool, to be batch repaired by our operations team. This architecture has enabled us to have no dedicated Cloudflare staff at any of the 200 cities, instead relying on help for infrequent physical tasks from the ISPs (or data centers) hosting our equipment.

Gen X: Intel Not Inside

Cloudflare’s Gen X: 
Servers for an Accelerated Future

We recently turned up our tenth generation of servers, “Gen X”, already deployed across major US cities, and in the process of being shipped worldwide. Compared with our prior server (Gen 9), it processes as much as 36% more requests while costing substantially less. Additionally, it enables a ~50% decrease in L3 cache miss rate and up to 50% decrease in NGINX p99 latency, powered by a CPU rated at 25% lower TDP (thermal design power) per core.

Notably, for the first time, Intel is not inside. We are not using their hardware for any major server components such as the CPU, board, memory, storage, network interface card (or any type of accelerator). Given how critical Intel is to our industry, this would until recently have been unimaginable, and is in contrast with prior generations which made extensive use of their hardware.

Cloudflare’s Gen X: 
Servers for an Accelerated Future
Intel-based Gen 9 server

This time, AMD is inside.

We were particularly impressed by the 2nd Gen AMD EPYC processors because they proved to be far more efficient for our customers’ workloads. Since the pendulum of technology leadership swings back and forth between providers, we wouldn’t be surprised if that changes over time. However, we were happy to adapt quickly to the components that made the most sense for us.

Compute

Cloudflare’s Gen X: 
Servers for an Accelerated Future

CPU efficiency is very important to our server design. Since we have a compute-heavy workload, our servers are typically limited by the CPU before other components. Cloudflare’s software stack scales quite well with additional cores. So, we care more about core-count and power-efficiency than dimensions such as clock speed.

We selected the AMD EPYC 7642 processor in a single-socket configuration for Gen X. This CPU has 48-cores (96 threads), a base clock speed of 2.4 GHz, and an L3 cache of 256 MB. While the rated power (225W) may seem high, it is lower than the combined TDP in our Gen 9 servers and we preferred the performance of this CPU over lower power variants. Despite AMD offering a higher core count option with 64-cores, the performance gains for our software stack and usage weren’t compelling enough.

We have deployed the AMD EPYC 7642 in half a dozen Cloudflare data centers; it is considerably more powerful than a dual-socket pair of high-core count Intel processors (Skylake as well as Cascade Lake) we used in the last generation.

Readers of our blog might remember our excitement around ARM processors. We even ported the entirety of our software stack to run on ARM, just as it does with x86, and have been maintaining that ever since even though it calls for slightly more work for our software engineering teams. We did this leading up to the launch of Qualcomm’s Centriq server CPU, which eventually got shuttered. While none of the off-the-shelf ARM CPUs available this moment are interesting to us, we remain optimistic about high core count offerings launching in 2020 and beyond, and look forward to a day when our servers are a mix of x86 (Intel and AMD) and ARM.

We aim to replace servers when the efficiency gains enabled by new equipment outweigh their cost.

The performance we’ve seen from the AMD EPYC 7642 processor has encouraged us to accelerate replacement of multiple generations of Intel-based servers.

Compute is our largest investment in a server. Our heaviest workloads, from the Firewall to Workers (our serverless offering), often require more compute than other server resources. Also, the average size in kilobytes of a web request across our network tends to be small, influenced in part by the relative popularity of APIs and mobile applications. Our approach to server design is very different than traditional content delivery networks engineered to deliver large object video libraries, for whom servers focused on storage might make more sense, and re-architecting to offer serverless is prohibitively capital intensive.

Our Gen X server is intentionally designed with an “empty” PCIe slot for a potential add on card, if it can perform some functions more efficiently than the primary CPU. Would that be a GPU, FPGA, SmartNIC, custom ASIC, TPU or something else? We’re intrigued to explore the possibilities.

In accompanying blog posts over the next few days, our hardware engineers will describe how AMD 7642 performed against the benchmarks we care about. We are thankful for their hard work.

Memory, Storage & Network

Since we are typically limited by CPU, Gen X represented an opportunity to grow components such as RAM and SSD more slowly than compute.

Cloudflare’s Gen X: 
Servers for an Accelerated Future

For memory, we continued to use 256GB of RAM, as in our prior generation, but rated higher at 2933MHz. For storage, we continue to have ~3TB, but moved to 3x1TB form factor using NVME flash (instead of SATA) with increased available IOPS and higher endurance, which enables full disk encryption using LUKS without penalty. For the network card, we continue to use Mellanox 2x25G NIC.

Cloudflare’s Gen X: 
Servers for an Accelerated Future

We moved from our multi-node chassis back to a simple 1U form factor, designed to be lighter and less error prone during operational work at the data center. We also added multiple new ODM partners to diversify how we manufacture our equipment and to take advantage of additional global warehousing.

Cloudflare’s Gen X: 
Servers for an Accelerated Future

Network Expansion

Cloudflare’s Gen X: 
Servers for an Accelerated Future

Our newest generation of servers give us the flexibility to continue to build out our network even closer to every user on Earth. We’re proud of the hard work from across engineering teams on Gen X, and are grateful for the support of our partners. Be on the lookout for more blogs about these servers in the coming days.

Cloudflare Expanded to 200 Cities in 2019

Post Syndicated from Jon Rolfe original https://blog.cloudflare.com/cloudflare-expanded-to-200-cities-in-2019/

Cloudflare Expanded to 200 Cities in 2019

Cloudflare Expanded to 200 Cities in 2019

We have exciting news: Cloudflare closed out the decade by reaching our 200th city* across 90+ countries. Each new location increases the security, performance, and reliability of the 20-million-plus Internet properties on our network. Over the last quarter, we turned up seven data centers spanning from Chattogram, Bangladesh all the way to the Hawaiian Islands:

  • Chattogram & Dhaka, Bangladesh. These data centers are our first in Bangladesh, ensuring that its 161 million residents will have a better experience on our network.
  • Honolulu, Hawaii, USA. Honolulu is one of the most remote cities in the world; with our Honolulu data center up and running, Hawaiian visitors can be served 2,400 miles closer than ever before! Hawaii is a hub for many submarine cables in the Pacific, meaning that some Pacific Islands will also see significant improvements.
  • Adelaide, Australia. Our 7th Australasian data center can be found “down under” in the capital of South Australia. Despite being Australia’s fifth-largest city, Adelaide is often overlooked for Australian interconnection. We, for one, are happy to establish a presence in it and its unique UTC+9:30 time zone!
  • Thimphu, Bhutan. Bhutan is the sixth SAARC (South Asian Association for Regional Cooperation) country with a Cloudflare network presence. Thimphu is our first Bhutanese data center, continuing our mission of security and performance for all.
  • St George’s, Grenada. Our Grenadian data center is joining the Grenada Internet Exchange (GREX), the first non-profit Internet Exchange (IX) in the English-speaking Caribbean.

We’ve come a long way since our launch in 2010, moving from colocating in key Internet hubs to fanning out across the globe and partnering with local ISPs. This has allowed us to offer security, performance, and reliability to Internet users in all corners of the world. In addition to the 35 cities we added in 2019, we expanded our existing data centers behind-the-scenes. We believe there are a lot of opportunities to harness in 2020 as we look to bring our network and its edge-computing power closer and closer to everyone on the Internet.

*Includes cities where we have data centers with active Internet ports and those where we are configuring our servers to handle traffic for more customers (at the time of publishing).

Good Morning, Jakarta!

Post Syndicated from Chris Chua original https://blog.cloudflare.com/selamat-pagi-jakarta-customers/

Good Morning, Jakarta!

Good Morning, Jakarta!

Beneath the veneer of glass and concrete, this is a city of surprises and many faces. On 3rd October 2019, we brought together a group of leaders from across a number of industries to connect in Central Jakarta, Indonesia.

The habit of sharing stories at the lunch table, exchanging ideas, and listening to ideas from the different viewpoints of people from all tiers, paying first-hand attention to all input from customers, and listening to the dreams of some of life’s warriors may sound simple but it is a source of inspiration and encouragement in helping the cyberspace community in this region.

And our new data center in Jakarta extends our Asia Pacific network to 64 cities, and our global network to 194 cities.

Selamat Pagi

Right on time, Kate Fleming extended a warm welcome to our all our Indonesia guests. “We were especially appreciative of the investment of your time that you made coming to join us.”

Kate, is the Head of Customer Success for APAC. Australian-born, Kate spent the past 5 years living in Malaysia and Singapore. She leads a team of Customer Success Managers in Singapore. The Customer Success team is dispersed across multiple offices and time zones. We are the advocates for Cloudflare Enterprise customers. We help with your on-boarding journey and various post sales activities from project and resource management planning to training, configuration recommendations, sharing best practices, point of escalation and more.

Good Morning, Jakarta!

“Today, the Indonesian Cloudflare team would like to share with you some insights and best practices around how Cloudflare is not only a critical part of any organization’s cyber security planning, but is working towards building a better internet in the process.” – Kate


Ayush Verma, who is our Solutions Engineer for ASEAN and India, was there to unveil the latest cyber security trends. He shared insights on how to stay ahead of the game in the fast-charging online environment.

Get answers to questions like:
How can I secure my site without sacrificing performance?
What are the latest trends in malicious attacks — and how should I prepare?

Good Morning, Jakarta!

Good Morning, Jakarta!

Superheroes Behind The Scenes

We were very honored to have two industry leaders speak to us.

Jullian Gafar, the CTO from PT Viva Media Baru.
PT Viva Media Baru is an online media company based out of Jakarta, Indonesia.

Firman Gautama, the VP of Infrastructure & Security from PT. Global Tiket Network.
PT. Global Tiket Network offer hotel, flight, car rental, train, world class event/concert and attraction tickets.

It was a golden opportunity to hear from the leaders themselves about what’s keeping them busy lately, their own approaches to cyber security, best practices, and easy-to-implement and cost-efficient strategies.  

Fireside Chat Highlights:  Shoutout from Pak Firman, who was very pleased with the support he received from Kartika. He said “most sales people are hard to reach after completing a sale. Kartika always goes the extra mile, she stays engaged with me. The Customer Experience is just exceptional.”

Good Morning, Jakarta!

Our Mission Continues

Thank you for giving us your time to connect. It brings us back to our roots and core mission of helping to build a better internet. Based on this principle “The Result Never Betrays the Effort’ we believe that what we are striving for today, by creating various innovations in our services and strategies to improve your business, will in time produce the best results. For this reason, we offer our endless thanks for your support and loyalty in continuing to push forward with us. Always at your service!

Good Morning, Jakarta!

Cloudflare Event Crew in Indonesia #CloudflareJKT
Chris Chua (Organiser) | Kate Fleming | Bentara Frans | Ayush Verma | Welly Tandiono | Kartika Mulyo  | Riyan Baharudin

Making the Data Center Choice (the Work Has Just Begun)

Post Syndicated from Ahin Thomas original https://www.backblaze.com/blog/picking-right-eu-data-center-partner/

Globe with Europe

It’s Europe week at Backblaze! On Tuesday, we announced the opening of our first European data center. On Wednesday, we discussed the process of sourcing DCs. Yesterday, we focused on how we narrowed the list down to a small group of finalists. And, today, we’ll share how we ultimately decided on our new partner.

Imagine a globe spinning (or simply look at the top of this blog post). When you start out on a data center search, you could consider almost any corner of the globe. For Backblaze, we knew we wanted to find an anchor location in the European Union. For a variety of reasons, we quickly narrowed in on Amsterdam, Brussels and Dublin as the most likely locations. While we were able to generate a list of 40 qualified locations, narrowed it down to ten for physical visits, and then narrowed it yet again to three finalists, the question remained: How would we choose our ultimate partner? Data center searches have changed a lot since 2012 when we circulated our RFP for a previous expansion.

The good news is we knew our top line requirements would be met. Thinking back to the 2×2 that our Chief Cloud Officer, Tim Nufire, had drawn on the board at the early stages of our search, we felt good that we had weighed the tradeoffs appropriately.

EU data center cost risk quadrant
Cost vs risk

Similarly to hiring an employee, after the screening and the interviews, one runs reference checks. In the case of data centers, that means both validating certain assertions and going into the gory details on certain operational capabilities. For example, in our second post in the EU DC series, we mentioned environmental risks. If one is looking to reduce the probability of catastrophe, making sure that your DC is outside of a flood zone is generally advisable. Of course, the best environmental risk factor reports are much more nuanced and account for changes in the environment.

To help us investigate those sorts of issues, we partnered with PTS Consulting. By engaging with third party experts, we get dispassionate, unbiased, thorough reporting about the locations we are considering. Based on PTS’s reporting, we eliminated one of our finalists. To be clear, there was nothing inherently wrong with the finalist, but it was unlikely that particular location would sustainably meet our long term requirements without significant infrastructure upgrades on their end.

In our prior posts, we mentioned another partner, UpStack. Their platform helped us with the sourcing and narrowing down to a list of finalists. Importantly, their advisory services were crucial in this final stage of diligence. Specifically, UpStack brought in electrical engineering expertise to give us a deep, detailed assessment of the electrical mechanical single line diagrams. For those less versed in the aspects of DC power, that means UpStack was able to go into incredible granularity in looking at the reliability and durability of the power sources of our DCs.

Ultimately, it came down to two finalists:

  • DC 3: Interxion Amsterdam
  • DC 4: The pre-trip favorite

DC four had a lot of things going for it. The pricing was the most affordable and the facility had more modern features and functionality. The biggest downsides were open issues around sourcing and training what would become our remote hands team.

Which gets us back to our matrix of tradeoffs. While more expensive than DC three, Interxion facility graded out equally well during diligence. Ultimately, the people at Interxion and confidence in the ability to build out a sturdy remote hands team made the choice of Interxion clear.

Cost vs risk and result
Cost vs risk and result

Looking back at Tim’s 2×2, DC four presented as financially more affordable, but operationally a little more risky (since we had questions about our ability to effectively operate on a day to day basis).

Interxion, while a little more financially expensive, reduced our operational risks. When thinking of our anchor location in Europe, that felt like the right tradeoff to be making.

Ready, Set, More Work!

The site selection only represented part of the journey. In parallel, our sourcing team has had to learn how to get pods and drives into Europe. Our Tech Ops & Engineering teams have worked through any number of issues around latency, performance, and functionality. Finance & Legal has worked through the implications of having a physical international footprint. And that’s just to name a few things.

Interxion - Backblaze data center floor plan
EU data center floor plan

If you’re in the EU, we’ll be at IBC 2019 in Amsterdam from September 13 to September 17. If you’re interested in making an appointment to chat further, use our form to reserve a time at IBC, or drop by stand 7.D67 at IBC (our friends from Cantemo are hosting us). Or, if you prefer, feel free to leave any questions in the comments below!

The post Making the Data Center Choice (the Work Has Just Begun) appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Cloudflare Global Network Expands to 193 Cities

Post Syndicated from Nitin Rao original https://blog.cloudflare.com/scaling-the-cloudflare-global/

Cloudflare Global Network Expands to 193 Cities

Cloudflare’s global network currently spans 193 cities across 90+ countries. With over 20 million Internet properties on our network, we increase the security, performance, and reliability of large portions of the Internet every time we add a location.

Cloudflare Global Network Expands to 193 Cities

Expanding Network to New Cities

So far in 2019, we’ve added a score of new locations: Amman, Antananarivo*, Arica*, Asunción, Bengaluru, Buffalo, Casablanca, Córdoba*, Cork, Curitiba, Dakar*, Dar es Salaam, Fortaleza, Göteborg, Guatemala City, Hyderabad, Kigali, Kolkata, Male*, Maputo, Nagpur, Neuquén*, Nicosia, Nouméa, Ottawa, Port-au-Prince, Porto Alegre, Querétaro, Ramallah, and Thessaloniki.

Our Humble Beginnings

When Cloudflare launched in 2010, we focused on putting servers at the Internet’s crossroads: large data centers with key connections, like the Amsterdam Internet Exchange and Equinix Ashburn. This not only provided the most value to the most people at once but was also easier to manage by keeping our servers in the same buildings as all the local ISPs, server providers, and other people they needed to talk to streamline our services.

This is a great approach for bootstrapping a global network, but we’re obsessed with speed in general. There are over five hundred cities in the world with over one million inhabitants, but only a handful of them have the kinds of major Internet exchanges that we targeted. Our goal as a company is to help make a better Internet for all, not just those lucky enough to live in areas with affordable and easily-accessible interconnection points. However, we ran up against two broad, nasty problems: a) running out of major Internet exchanges and b) latency still wasn’t as low as we wanted. Clearly, we had to start scaling in new ways.

One of our first big steps was entering into partnerships around the world with local ISPs, who have many of the same problems we do: ISPs want to save money and provide fast Internet to their customers, but they often don’t have a major Internet exchange nearby to connect to. Adding Cloudflare equipment to their infrastructure effectively brought more of the Internet closer to them. We help them speed up millions of Internet properties while reducing costs by serving traffic locally. Additionally, since all of our servers are designed to support all our products, a relatively small physical footprint can also provide security, performance, reliability, and more.

Upgrading Capacity in Existing Cities

Though it may be obvious and easy to overlook, continuing to build out existing locations is also a key facet of building a global network. This year, we have significantly increased the computational capacity at the edge of our network. Additionally, by making it easier to interconnect with Cloudflare, we have increased the number of unique networks directly connected with us to over 8,000. This makes for a faster, more reliable Internet experience for the >1 billion IPs that we see daily.

To make these capacity upgrades possible for our customers, efficient infrastructure deployment has been one of our keys to success. We want our infrastructure deployment to be targeted and flexible.

Targeted Deployment

The next Cloudflare customer through our door could be a small restaurant owner on a Pro plan with thousands of monthly pageviews or a fast-growing global tech company like Discord. As a result, we need to always stay one step ahead and synthesize a lot of data all at once for our customers.

To accommodate this expansion, our Capacity Planning team is learning new ways to optimize our servers. One key strategy is targeting exactly where to send our servers. However, staying on top of everything isn’t easy – we are a global anycast network, which introduces unpredictability as to where incoming traffic goes. To make things even more difficult, each city can contain as many as five distinct deployments. Planning isn’t just a question of what city to send servers to, it’s one of which address.

To make sense of it all, we tackle the problem with simulations. Some, but not all, of the variables we model include historical traffic growth rates, foreseeable anomalous spikes (e.g., Cyber Day in Chile), and consumption states from our live deal pipeline, as well as product costs, user growth, end-customer adoption. We also add in site reliability, potential for expansion, and expected regional expansion and partnerships, as well as strategic priorities and, of course, feedback from our fantastic Systems Reliability Engineers.

Flexible Supply Chain

Knowing where to send a server is only the first challenge of many when it comes to a global network. Just like our user base, our supply chain must span the entire world while also staying flexible enough to quickly react to time constraints, pricing changes including taxes and tariffs, import/export restrictions and required certifications – not to mention local partnerships many more dynamic location-specific variables. Even more reason we have to stay quick on our feet, there will always be unforeseen roadblocks and detours even in the most well-prepared plans. For example, a planned expansion in our Prague location might warrant an expanded presence in Vienna for failover.

Once servers arrive at our data centers, our Data Center Deployment and Technical Operations teams work with our vendors and on-site data center personnel (our “Remote Hands” and “Smart Hands”) to install the physical server, manage the cabling, and handle other early-stage provisioning processes.

Our architecture, which is designed so that every server can support every service, makes it easier to withstand hardware failures and efficiently load balance workloads between equipment and between locations.

Join Our Team

If working at a rapidly expanding, globally diverse company interests you, we’re hiring for scores of positions, including in the Infrastructure group. If you want to help increase hardware efficiency, deploy and maintain servers, work on our supply chain, or strengthen ISP partnerships, get in touch.

*Represents cities where we have data centers with active Internet ports and where we are configuring our servers to handle traffic for more customers (at the time of publishing)

Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture

Post Syndicated from Brian Beach original https://www.backblaze.com/blog/vault-cloud-storage-architecture/

A lot has changed in the four years since Brian Beach wrote a post announcing Backblaze Vaults, our software architecture for cloud data storage. Just looking at how the major statistics have changed, we now have over 100,000 hard drives in our data centers instead of the 41,000 mentioned in the post video. We have three data centers (soon four) instead of one data center. We’re approaching one exabyte of data stored for our customers (almost seven times the 150 petabytes back then), and we’ve recovered over 41 billion files for our customers, up from the 10 billion in the 2015 post.

In the original post, we discussed having durability of seven nines. Shortly thereafter, it was upped to eight nines. In July of 2018, we took a deep dive into the calculation and found our durability closer to eleven nines (and went into detail on the calculations used to arrive at that number). And, as followers of our Hard Drive Stats reports will be interested in knowing, we’ve just started using our first 16 TB drives, which are twice the size of the biggest drives we used back at the time of this post — then a whopping eight TB.

We’ve updated the details here and there in the text from the original post that was published on our blog on March 11, 2015. We’ve left the original 135 comments intact, although some of them might be non sequiturs after the changes to the post. We trust that you will be able to sort out the old from the new and make sense of what’s changed. If not, please add a comment and we’ll be happy to address your questions.

— Editor

Storage Vaults form the core of Backblaze’s cloud services. Backblaze Vaults are not only incredibly durable, scalable, and performant, but they dramatically improve availability and operability, while still being incredibly cost-efficient at storing data. Back in 2009, we shared the design of the original Storage Pod hardware we developed; here we’ll share the architecture and approach of the cloud storage software that makes up a Backblaze Vault.

Backblaze Vault Architecture for Cloud Storage

The Vault design follows the overriding design principle that Backblaze has always followed: keep it simple. As with the Storage Pods themselves, the new Vault storage software relies on tried and true technologies used in a straightforward way to build a simple, reliable, and inexpensive system.

A Backblaze Vault is the combination of the Backblaze Vault cloud storage software and the Backblaze Storage Pod hardware.

Putting The Intelligence in the Software

Another design principle for Backblaze is to anticipate that all hardware will fail and build intelligence into our cloud storage management software so that customer data is protected from hardware failure. The original Storage Pod systems provided good protection for data and Vaults continue that tradition while adding another layer of protection. In addition to leveraging our low-cost Storage Pods, Vaults take advantage of the cost advantage of consumer-grade hard drives and cleanly handle their common failure modes.

Distributing Data Across 20 Storage Pods

A Backblaze Vault is comprised of 20 Storage Pods, with the data evenly spread across all 20 pods. Each Storage Pod in a given vault has the same number of drives, and the drives are all the same size.

Drives in the same drive position in each of the 20 Storage Pods are grouped together into a storage unit we call a tome. Each file is stored in one tome and is spread out across the tome for reliability and availability.

20 hard drives create 1 tome that share parts of a file.

Every file uploaded to a Vault is divided into pieces before being stored. Each of those pieces is called a shard. Parity shards are computed to add redundancy, so that a file can be fetched from a vault even if some of the pieces are not available.

Each file is stored as 20 shards: 17 data shards and three parity shards. Because those shards are distributed across 20 Storage Pods, the Vault is resilient to the failure of a Storage Pod.

Files can be written to the Vault when one pod is down and still have two parity shards to protect the data. Even in the extreme and unlikely case where three Storage Pods in a Vault lose power, the files in the vault are still available because they can be reconstructed from any of the 17 pods that are available.

Storing Shards

Each of the drives in a Vault has a standard Linux file system, ext4, on it. This is where the shards are stored. There are fancier file systems out there, but we don’t need them for Vaults. All that is needed is a way to write files to disk and read them back. Ext4 is good at handling power failure on a single drive cleanly without losing any files. It’s also good at storing lots of files on a single drive and providing efficient access to them.

Compared to a conventional RAID, we have swapped the layers here by putting the file systems under the replication. Usually, RAID puts the file system on top of the replication, which means that a file system corruption can lose data. With the file system below the replication, a Vault can recover from a file system corruption because a single corrupt file system can lose at most one shard of each file.

Creating Flexible and Optimized Reed-Solomon Erasure Coding

Just like RAID implementations, the Vault software uses Reed-Solomon erasure coding to create the parity shards. But, unlike Linux software RAID, which offers just one or two parity blocks, our Vault software allows for an arbitrary mix of data and parity. We are currently using 17 data shards plus three parity shards, but this could be changed on new vaults in the future with a simple configuration update.

Vault Row of Storage Pods

For Backblaze Vaults, we threw out the Linux RAID software we had been using and wrote a Reed-Solomon implementation from scratch, which we wrote about in Backblaze Open Sources Reed-Solomon Erasure Coding Source Code. It was exciting to be able to use our group theory and matrix algebra from college.

The beauty of Reed-Solomon is that we can then re-create the original file from any 17 of the shards. If one of the original data shards is unavailable, it can be re-computed from the other 16 original shards, plus one of the parity shards. Even if three of the original data shards are not available, they can be re-created from the other 17 data and parity shards. Matrix algebra is awesome!

Handling Drive Failures

The reason for distributing the data across multiple Storage Pods and using erasure coding to compute parity is to keep the data safe and available. How are different failures handled?

If a disk drive just up and dies, refusing to read or write any data, the Vault will continue to work. Data can be written to the other 19 drives in the tome, because the policy setting allows files to be written as long as there are two parity shards. All of the files that were on the dead drive are still available and can be read from the other 19 drives in the tome.

Building a Backblaze Vault Storage Pod

When a dead drive is replaced, the Vault software will automatically populate the new drive with the shards that should be there; they can be recomputed from the contents of the other 19 drives.

A Vault can lose up to three drives in the same tome at the same moment without losing any data, and the contents of the drives will be re-created when the drives are replaced.

Handling Data Corruption

Disk drives try hard to correctly return the data stored on them, but once in a while they return the wrong data, or are just unable to read a given sector.

Every shard stored in a Vault has a checksum, so that the software can tell if it has been corrupted. When that happens, the bad shard is recomputed from the other shards and then re-written to disk. Similarly, if a shard just can’t be read from a drive, it is recomputed and re-written.

Conventional RAID can reconstruct a drive that dies, but does not deal well with corrupted data because it doesn’t checksum the data.

Scaling Horizontally

Each vault is assigned a number. We carefully designed the numbering scheme to allow for a lot of vaults to be deployed, and designed the management software to handle scaling up to that level in the Backblaze data centers.

The overall design scales very well because file uploads (and downloads) go straight to a vault, without having to go through a central point that could become a bottleneck.

There is an authority server that assigns incoming files to specific Vaults. Once that assignment has been made, the client then uploads data directly to the Vault. As the data center scales out and adds more Vaults, the capacity to handle incoming traffic keeps going up. This is horizontal scaling at its best.

We could deploy a new data center with 10,000 Vaults holding 16TB drives and it could accept uploads fast enough to reach its full capacity of 160 exabytes in about two months!

Backblaze Vault Benefits

The Backblaze Vault architecture has six benefits:

1. Extremely Durable

The Vault architecture is designed for 99.999999% (eight nines) annual durability (now 11 nines — Editor). At cloud-scale, you have to assume hard drives die on a regular basis, and we replace about 10 drives every day. We have published a variety of articles sharing our hard drive failure rates.

The beauty with Vaults is that not only does the software protect against hard drive failures, it also protects against the loss of entire Storage Pods or even entire racks. A single Vault can have three Storage Pods — a full 180 hard drives — die at the exact same moment without a single byte of data being lost or even becoming unavailable.

2. Infinitely Scalable

A Backblaze Vault is comprised of 20 Storage Pods, each with 60 disk drives, for a total of 1200 drives. Depending on the size of the hard drive, each vault will hold:

12TB hard drives => 12.1 petabytes/vault (Deploying today.)
14TB hard drives => 14.2 petabytes/vault (Deploying today.)
16TB hard drives => 16.2 petabytes/vault (Small-scale testing.)
18TB hard drives => 18.2 petabytes/vault (Announced by WD & Toshiba)
20TB hard drives => 20.2 petabytes/vault (Announced by Seagate)

Backblaze Data Center

At our current growth rate, Backblaze deploys one to three Vaults each month. As the growth rate increases, the deployment rate will also increase. We can incrementally add more storage by adding more and more Vaults. Without changing a line of code, the current implementation supports deploying 10,000 Vaults per location. That’s 90 exabytes of data in each location. The implementation also supports up to 1,000 locations, which enables storing a total of 90 zettabytes! (Also knowWithout changing a line of code, the current implementation supports deploying 10,000 Vaults per location. That’s 160 exabytes of data in each location. The implementation also supports up to 1,000 locations, which enables storing a total of 160 zettabytes! (Also known as 160,000,000,000,000 GB.)

3. Always Available

Data backups have always been highly available: if a Storage Pod was in maintenance, the Backblaze online backup application would contact another Storage Pod to store data. Previously, however, if a Storage Pod was unavailable, some restores would pause. For large restores this was not an issue since the software would simply skip the Storage Pod that was unavailable, prepare the rest of the restore, and come back later. However, for individual file restores and remote access via the Backblaze iPhone and Android apps, it became increasingly important to have all data be highly available at all times.

The Backblaze Vault architecture enables both data backups and restores to be highly available.

With the Vault arrangement of 17 data shards plus three parity shards for each file, all of the data is available as long as 17 of the 20 Storage Pods in the Vault are available. This keeps the data available while allowing for normal maintenance and rare expected failures.

4. Highly Performant

The original Backblaze Storage Pods could individually accept 950 Mbps (megabits per second) of data for storage.

The new Vault pods have more overhead, because they must break each file into pieces, distribute the pieces across the local network to the other Storage Pods in the vault, and then write them to disk. In spite of this extra overhead, the Vault is able to achieve 1,000 Mbps of data arriving at each of the 20 pods.

Backblaze Vault Networking

This capacity required a new type of Storage Pod that could handle this volume. The net of this: a single Vault can accept a whopping 20 Gbps of data.

Because there is no central bottleneck, adding more Vaults linearly adds more bandwidth.

5. Operationally Easier

When Backblaze launched in 2008 with a single Storage Pod, many of the operational analyses (e.g. how to balance load) could be done on a simple spreadsheet and manual tasks (e.g. swapping a hard drive) could be done by a single person. As Backblaze grew to nearly 1,000 Storage Pods and over 40,000 hard drives, the systems we developed to streamline and operationalize the cloud storage became more and more advanced. However, because our system relied on Linux RAID, there were certain things we simply could not control.

With the new Vault software, we have direct access to all of the drives and can monitor their individual performance and any indications of upcoming failure. And, when those indications say that maintenance is needed, we can shut down one of the pods in the Vault without interrupting any service.

6. Astoundingly Cost Efficient

Even with all of these wonderful benefits that Backblaze Vaults provide, if they raised costs significantly, it would be nearly impossible for us to deploy them since we are committed to keeping our online backup service affordable for completely unlimited data. However, the Vault architecture is nearly cost neutral while providing all these benefits.

Backblaze Vault Cloud Storage

When we were running on Linux RAID, we used RAID6 over 15 drives: 13 data drives plus two parity. That’s 15.4% storage overhead for parity.

With Backblaze Vaults, we wanted to be able to do maintenance on one pod in a vault and still have it be fully available, both for reading and writing. And, for safety, we weren’t willing to have fewer than two parity shards for every file uploaded. Using 17 data plus three parity drives raises the storage overhead just a little bit, to 17.6%, but still gives us two parity drives even in the infrequent times when one of the pods is in maintenance. In the normal case when all 20 pods in the Vault are running, we have three parity drives, which adds even more reliability.

Summary

Backblaze’s cloud storage Vaults deliver 99.999999% (eight nines) annual durability (now 11 nines — Editor), horizontal scalability, and 20 Gbps of per-Vault performance, while being operationally efficient and extremely cost effective. Driven from the same mindset that we brought to the storage market with Backblaze Storage Pods, Backblaze Vaults continue our singular focus of building the most cost-efficient cloud storage available anywhere.

•  •  •

Note: This post was updated from the original version posted on March 11, 2015.

The post Backblaze Vaults: Zettabyte-Scale Cloud Storage Architecture appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

These Aren’t Your Ordinary Data Centers

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/these-arent-your-ordinary-data-centers/

Barcelona Supercomputing Center

Many of us would concede that buildings housing data centers are generally pretty ordinary places. They’re often drab and bunker-like with few or no windows, and located in office parks or in rural areas. You usually don’t see signs out front announcing what they are, and, if you’re not in information technology, you might be hard pressed to guess what goes on inside.

If you’re observant, you might notice cooling towers for air conditioning and signs of heavy electrical usage as clues to their purpose. For most people, though, data centers go by unnoticed and out of mind. Data center managers like it that way, because the data stored in and passing through these data centers is the life’s blood of business, research, finance, and our modern, digital-based lives.

That’s why the exceptions to low-key and meh data centers are noteworthy. These unusual centers stand out for their design, their location, what the building was previously used for, or perhaps how they approach energy usage or cooling.

Let’s take a look at a handful of data centers that certainly are outside of the norm.

The Underwater Data Center

Microsoft’s rationale for putting a data center underwater makes sense. Most people live near water, they say, and their submersible data center is quick to deploy, and can take advantage of hydrokinetic energy for power and natural cooling.

Project Natick has produced an experimental, shipping-container-size prototype designed to process data workloads on the seafloor near Scotland’s Orkney Islands. It’s part of a years-long research effort to investigate manufacturing and operating environmentally sustainable, prepackaged datacenter units that can be ordered to size, rapidly deployed, and left to operate independently on the seafloor for years.

Microsoft's Project Natick
Microsoft’s Project Natick at the launch site in the city of Stromness on Orkney Island, Scotland on Sunday May 27, 2018. (Photography by Scott Eklund/Red Box Pictures)
Natick Brest
Microsoft’s Project Natick in Brest, France

The Supercomputing Center in a Former Catholic Church

One might be forgiven for mistaking Torre Girona for any normal church, but this deconsecrated 20th century church currently houses the Barcelona Supercomputing Center, home of the MareNostrum supercomputer. As part of the Polytechnic University of Catalonia, this supercomputer (Latin for Our sea, the Roman name for the Mediterranean Sea), is used for a range of research projects, from climate change to cancer research, biomedicine, weather forecasting, and fusion energy simulations.

Torre Girona. a former Catholic church in Barcelona
Torre Girona, a former Catholic church in Barcelona
The Barcelona Supercomputing Center, home of the MareNostrum supercomputer
The Barcelona Supercomputing Center, home of the MareNostrum supercomputer

The Barcelona Supercomputing Center, home of the MareNostrum supercomputer

The Barcelona Supercomputing Center, home of the MareNostrum supercomputer

The Under-a-Mountain Bond Supervillain Data Center

Most data centers don’t have the extreme protection or history of the The Bahnhof Data Center, which is located inside the ultra-secure former nuclear bunker Pionen, in Stockholm, Sweden. It is buried 100 feet below ground inside the White Mountains and secured behind 15.7 in. thick metal doors. It prides itself on its self-described Bond villain ambiance.

We previously wrote about this extraordinary data center in our post, The Challenges of Opening a Data Center — Part 1.

The Bahnhof Data Center under White Mountain in Stockholm, Sweden
The Bahnhof Data Center under White Mountain in Stockholm, Sweden

The Data Center That Can Survive a Class 5 Hurricane

Sometimes the location of the center comes first and the facility is hardened to withstand anticipated threats, such as Equinix’s NAP of the Americas data center in Miami, one of the largest single-building data centers on the planet (six stories and 750,000 square feet), which is built 32 feet above sea level and designed to withstand category five hurricane winds.

The MI1 facility provides access for the Caribbean, South and Central America to “to more than 148 countries worldwide,” and is the primary network exchange between Latin America and the U.S., according to Equinix. Any outage in this data center could potentially cripple businesses passing information between these locations.

The center was put to the test in 2017 when Hurricane Irma, a class 5 hurricane in the Caribbean, made landfall in Florida as a class 4 hurricane. The storm caused extensive damage in Miami-Dade County, but the Equinix center survived.

Equinix NAP of the Americas Data Center in Miami
Equinix NAP of the Americas Data Center in Miami

The Data Center Cooled by Glacier Water

Located on Norway’s west coast, the Lefdal Mine Datacenter is built 150 meters into a mountain in what was formerly an underground mine for excavating olivine, also known as the gemstone peridot, a green, high- density mineral used in steel production. The data center is powered exclusively by renewable energy produced locally, while being cooled by water from the second largest fjord in Norway, which is 565 meters deep and fed by the water from four glaciers. As it’s in a mine, the data center is located below sea level, eliminating the need for expensive high-capacity pumps to lift the fjord’s water to the cooling system’s heat exchangers, contributing to the center’s power efficiency.

The Lefdal Mine Data Center in Norway
The Lefdal Mine Datacenter in Norway

The World’s Largest Data Center

The Tahoe Reno 1 data center in The Citadel Campus in Northern Nevada, with 7.2 million square feet of data center space, is the world’s largest data center. It’s not only big, it’s powered by 100% renewable energy with up to 650 megawatts of power.

The Switch Core Campus in Nevada
The Switch Core Campus in Nevada
Tahoe Reno Switch Data Center
Tahoe Reno Switch Data Center

An Out of This World Data Center

If the cloud isn’t far enough above us to satisfy your data needs, Cloud Constellation Corporation plans to put your data into orbit. A constellation of eight low earth orbit satellites (LEO), called SpaceBelt, will offer up to five petabytes of space-based secure data storage and services and will use laser communication links between the satellites to transmit data between different locations on Earth.

CCC isn’t the only player talking about space-based data centers, but it is the only one so far with 100 million in funding to make their plan a reality.

Cloud Constellation's SpaceBelt
Cloud Constellation’s SpaceBelt

A Cloud Storage Company’s Modest Beginnings

OK, so our current data centers are not that unusual (with the possible exception of our now iconic Storage Pod design), but Backblaze wasn’t always the profitable and growing cloud services company that it is today. hen Backblaze was just getting started and was figuring out how to make data storage work while keeping costs as low as possible for our customers.There was a time when Backblaze was just getting started, and before we had almost an exabyte of customer data storage, that we were figuring out how to make data storage work while keeping costs as low as possible for our customers.

The photo below is not exactly a data center, but it is the first data storage structure used by Backblaze to develop its storage infrastructure before going live with customer data. It was on the patio behind the Palo Alto apartment that Backblaze used for its first office.

Shed used for very early (pre-customer) data storage testing
Shed used for very early (pre-customer) data storage testing

The photos below (front and back) are of the very first data center cabinet that Backblaze filled with customer data. This was in 2009 in San Francisco, and just before we moved to a data center in Oakland where there was room to grow. Note the storage pod at the top of the cabinet. Yes, it’s made out of wood. (You have to start somewhere.)

Backblaze's first data storage cabinet to hold customer data (2009) (front)
Backblaze’s first data storage cabinet to hold customer data (2009) (front)
Backblaze's first data storage cabinet to hold customer data (2009) (back)
Backblaze’s first data storage cabinet to hold customer data (2009) (back)

Do You Know of Other Unusual Data Centers?

Do you know of another data center that should be on this list? Please tell us in the comments.

The post These Aren’t Your Ordinary Data Centers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AWS Resources Addressing Argentina’s Personal Data Protection Law and Disposition No. 11/2006

Post Syndicated from Leandro Bennaton original https://aws.amazon.com/blogs/security/aws-and-resources-addressing-argentinas-personal-data-protection-law-and-disposition-no-112006/

We have two new resources to help customers address their data protection requirements in Argentina. These resources specifically address the needs outlined under the Personal Data Protection Law No. 25.326, as supplemented by Regulatory Decree No. 1558/2001 (“PDPL”), including Disposition No. 11/2006. For context, the PDPL is an Argentine federal law that applies to the protection of personal data, including during transfer and processing.

A new webpage focused on data privacy in Argentina features FAQs, helpful links, and whitepapers that provide an overview of PDPL considerations, as well as our security assurance frameworks and international certifications, including ISO 27001, ISO 27017, and ISO 27018. You’ll also find details about our Information Request Report and the high bar of security at AWS data centers.

Additionally, we’ve released a new workbook that offers a detailed mapping as to how customers can operate securely under the Shared Responsibility Model while also aligning with Disposition No. 11/2006. The AWS Disposition 11/2006 Workbook can be downloaded from the Argentina Data Privacy page or directly from this link. Both resources are also available in Spanish from the Privacidad de los datos en Argentina page.

Want more AWS Security news? Follow us on Twitter.

 

Welcome Jack — Data Center Tech

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-jack-data-center-tech/

As we shoot way past 500 petabytes of data stored, we need a lot of helping hands in the data center to keep those hard drives spinning! We’ve been hiring quite a lot, and our latest addition is Jack. Lets learn a bit more about him, shall we?

What is your Backblaze Title?
Data Center Tech

Where are you originally from?
Walnut Creek, CA until 7th grade when the family moved to Durango, Colorado.

What attracted you to Backblaze?
I had heard about how cool the Backblaze community is and have always been fascinated by technology.

What do you expect to learn while being at Backblaze?
I expect to learn a lot about how our data centers run and all of the hardware behind it.

Where else have you worked?
Garrhs HVAC as an HVAC Installer and then Durango Electrical as a Low Volt Technician.

Where did you go to school?
Durango High School and then Montana State University.

What’s your dream job?
I would love to be a driver for the Audi Sport. Race cars are so much fun!

Favorite place you’ve traveled?
Iceland has definitely been my favorite so far.

Favorite hobby?
Video games.

Of what achievement are you most proud?
Getting my Eagle Scout badge was a tough, but rewarding experience that I will always cherish.

Star Trek or Star Wars?
Star Wars.

Coke or Pepsi?
Coke…I know, it’s bad.

Favorite food?
Thai food.

Why do you like certain things?
I tend to warm up to things the more time I spend around them, although I never really know until it happens.

Anything else you’d like to tell us?
I’m a friendly car guy who will always be in love with my European cars and I really enjoy the Backblaze community!

We’re happy you joined us Out West! Welcome aboard Jack!

The post Welcome Jack — Data Center Tech appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

C is to low level

Post Syndicated from Robert Graham original https://blog.erratasec.com/2018/05/c-is-too-low-level.html

I’m in danger of contradicting myself, after previously pointing out that x86 machine code is a high-level language, but this article claiming C is a not a low level language is bunk. C certainly has some problems, but it’s still the closest language to assembly. This is obvious by the fact it’s still the fastest compiled language. What we see is a typical academic out of touch with the real world.

The author makes the (wrong) observation that we’ve been stuck emulating the PDP-11 for the past 40 years. C was written for the PDP-11, and since then CPUs have been designed to make C run faster. The author imagines a different world, such as where CPU designers instead target something like LISP as their preferred language, or Erlang. This misunderstands the state of the market. CPUs do indeed supports lots of different abstractions, and C has evolved to accommodate this.


The author criticizes things like “out-of-order” execution which has lead to the Spectre sidechannel vulnerabilities. Out-of-order execution is necessary to make C run faster. The author claims instead that those resources should be spent on having more slower CPUs, with more threads. This sacrifices single-threaded performance in exchange for a lot more threads executing in parallel. The author cites Sparc Tx CPUs as his ideal processor.

But here’s the thing, the Sparc Tx was a failure. To be fair, it’s mostly a failure because most of the time, people wanted to run old C code instead of new Erlang code. But it was still a failure at running Erlang.

Time after time, engineers keep finding that “out-of-order”, single-threaded performance is still the winner. A good example is ARM processors for both mobile phones and servers. All the theory points to in-order CPUs as being better, but all the products are out-of-order, because this theory is wrong. The custom ARM cores from Apple and Qualcomm used in most high-end phones are so deeply out-of-order they give Intel CPUs competition. The same is true on the server front with the latest Qualcomm Centriq and Cavium ThunderX2 processors, deeply out of order supporting more than 100 instructions in flight.

The Cavium is especially telling. Its ThunderX CPU had 48 simple cores which was replaced with the ThunderX2 having 32 complex, deeply out-of-order cores. The performance increase was massive, even on multithread-friendly workloads. Every competitor to Intel’s dominance in the server space has learned the lesson from Sparc Tx: many wimpy cores is a failure, you need fewer beefy cores. Yes, they don’t need to be as beefy as Intel’s processors, but they need to be close.

Even Intel’s “Xeon Phi” custom chip learned this lesson. This is their GPU-like chip, running 60 cores with 512-bit wide “vector” (sic) instructions, designed for supercomputer applications. Its first version was purely in-order. Its current version is slightly out-of-order. It supports four threads and focuses on basic number crunching, so in-order cores seems to be the right approach, but Intel found in this case that out-of-order processing still provided a benefit. Practice is different than theory.

As an academic, the author of the above article focuses on abstractions. The criticism of C is that it has the wrong abstractions which are hard to optimize, and that if we instead expressed things in the right abstractions, it would be easier to optimize.

This is an intellectually compelling argument, but so far bunk.

The reason is that while the theoretical base language has issues, everyone programs using extensions to the language, like “intrinsics” (C ‘functions’ that map to assembly instructions). Programmers write libraries using these intrinsics, which then the rest of the normal programmers use. In other words, if your criticism is that C is not itself low level enough, it still provides the best access to low level capabilities.

Given that C can access new functionality in CPUs, CPU designers add new paradigms, from SIMD to transaction processing. In other words, while in the 1980s CPUs were designed to optimize C (stacks, scaled pointers), these days CPUs are designed to optimize tasks regardless of language.

The author of that article criticizes the memory/cache hierarchy, claiming it has problems. Yes, it has problems, but only compared to how well it normally works. The author praises the many simple cores/threads idea as hiding memory latency with little caching, but misses the point that caches also dramatically increase memory bandwidth. Intel processors are optimized to read a whopping 256 bits every clock cycle from L1 cache. Main memory bandwidth is orders of magnitude slower.

The author goes onto criticize cache coherency as a problem. C uses it, but other languages like Erlang don’t need it. But that’s largely due to the problems each languages solves. Erlang solves the problem where a large number of threads work on largely independent tasks, needing to send only small messages to each other across threads. The problems C solves is when you need many threads working on a huge, common set of data.

For example, consider the “intrusion prevention system”. Any thread can process any incoming packet that corresponds to any region of memory. There’s no practical way of solving this problem without a huge coherent cache. It doesn’t matter which language or abstractions you use, it’s the fundamental constraint of the problem being solved. RDMA is an important concept that’s moved from supercomputer applications to the data center, such as with memcached. Again, we have the problem of huge quantities (terabytes worth) shared among threads rather than small quantities (kilobytes).

The fundamental issue the author of the the paper is ignoring is decreasing marginal returns. Moore’s Law has gifted us more transistors than we can usefully use. We can’t apply those additional registers to just one thing, because the useful returns we get diminish.

For example, Intel CPUs have two hardware threads per core. That’s because there are good returns by adding a single additional thread. However, the usefulness of adding a third or fourth thread decreases. That’s why many CPUs have only two threads, or sometimes four threads, but no CPU has 16 threads per core.

You can apply the same discussion to any aspect of the CPU, from register count, to SIMD width, to cache size, to out-of-order depth, and so on. Rather than focusing on one of these things and increasing it to the extreme, CPU designers make each a bit larger every process tick that adds more transistors to the chip.

The same applies to cores. It’s why the “more simpler cores” strategy fails, because more cores have their own decreasing marginal returns. Instead of adding cores tied to limited memory bandwidth, it’s better to add more cache. Such cache already increases the size of the cores, so at some point it’s more effective to add a few out-of-order features to each core rather than more cores. And so on.

The question isn’t whether we can change this paradigm and radically redesign CPUs to match some academic’s view of the perfect abstraction. Instead, the goal is to find new uses for those additional transistors. For example, “message passing” is a useful abstraction in languages like Go and Erlang that’s often more useful than sharing memory. It’s implemented with shared memory and atomic instructions, but I can’t help but think it couldn’t better be done with direct hardware support.

Of course, as soon as they do that, it’ll become an intrinsic in C, then added to languages like Go and Erlang.

Summary

Academics live in an ideal world of abstractions, the rest of us live in practical reality. The reality is that vast majority of programmers work with the C family of languages (JavaScript, Go, etc.), whereas academics love the epiphanies they learned using other languages, especially function languages. CPUs are only superficially designed to run C and “PDP-11 compatibility”. Instead, they keep adding features to support other abstractions, abstractions available to C. They are driven by decreasing marginal returns — they would love to add new abstractions to the hardware because it’s a cheap way to make use of additional transitions. Academics are wrong believing that the entire system needs to be redesigned from scratch. Instead, they just need to come up with new abstractions CPU designers can add.

Welcome Josh — Data Center Technician

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-josh-datacenter-technician/

The Backblaze production team is growing and that means the data center is increasingly gaining some new faces. One of the newest to join the team is Josh! Lets learn a bit more about Josh shall we?

What is your Backblaze Title?
I’m a Data Center Technician in the Sacramento area.

Where are you originally from?
I lived all over the California central valley growing up.

What attracted you to Backblaze?
Backblaze is the best of a few worlds — cool startup meets professional DIYers meets transparent tech company (a rare thing).

What do you expect to learn while being at Backblaze?
I expect to learn about Data Center operations, and continue to develop the Linux skills that landed me here.

Favorite hobby?
Building and playing with new and useful toys.

Star Trek or Star Wars?
Darmok and Jalad at Tanagra.

Coke or Pepsi?
Good Beer.

Favorite food?
Tacos. No, burgers. No, it’s sushi. No, gyros. I can’t choose.

Why do you like certain things?
I like things that I can take apart and rebuild and turn every knob and adjust every piece. It means there’s a lot to learn, and I definitely like that.

Darmok and Jalad on the ocean! Welcome aboard Josh 😀

The post Welcome Josh — Data Center Technician appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Hard Drive Stats for Q1 2018

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-stats-for-q1-2018/

Backblaze Drive Stats Q1 2018

As of March 31, 2018 we had 100,110 spinning hard drives. Of that number, there were 1,922 boot drives and 98,188 data drives. This review looks at the quarterly and lifetime statistics for the data drive models in operation in our data centers. We’ll also take a look at why we are collecting and reporting 10 new SMART attributes and take a sneak peak at some 8 TB Toshiba drives. Along the way, we’ll share observations and insights on the data presented and we look forward to you doing the same in the comments.

Background

Since April 2013, Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers. Each entry consists of the date, manufacturer, model, serial number, status (operational or failed), and all of the SMART attributes reported by that drive. Currently there are about 97 million entries totaling 26 GB of data. You can download this data from our website if you want to do your own research, but for starters here’s what we found.

Hard Drive Reliability Statistics for Q1 2018

At the end of Q1 2018 Backblaze was monitoring 98,188 hard drives used to store data. For our evaluation below we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives. This leaves us with 98,046 hard drives. The table below covers just Q1 2018.

Q1 2018 Hard Drive Failure Rates

Notes and Observations

If a drive model has a failure rate of 0%, it only means there were no drive failures of that model during Q1 2018.

The overall Annualized Failure Rate (AFR) for Q1 is just 1.2%, well below the Q4 2017 AFR of 1.65%. Remember that quarterly failure rates can be volatile, especially for models that have a small number of drives and/or a small number of Drive Days.

There were 142 drives (98,188 minus 98,046) that were not included in the list above because we did not have at least 45 of a given drive model. We use 45 drives of the same model as the minimum number when we report quarterly, yearly, and lifetime drive statistics.

Welcome Toshiba 8TB drives, almost…

We mentioned Toshiba 8 TB drives in the first paragraph, but they don’t show up in the Q1 Stats chart. What gives? We only had 20 of the Toshiba 8 TB drives in operation in Q1, so they were excluded from the chart. Why do we have only 20 drives? When we test out a new drive model we start with the “tome test” and it takes 20 drives to fill one tome. A tome is the same drive model in the same logical position in each of the 20 Storage Pods that make up a Backblaze Vault. There are 60 tomes in each vault.

In this test, we created a Backblaze Vault of 8 TB drives, with 59 of the tomes being Seagate 8 TB drives and 1 tome being the Toshiba drives. Then we monitored the performance of the vault and its member tomes to see if, in this case, the Toshiba drives performed as expected.

Q1 2018 Hard Drive Failure Rate — Toshiba 8TB

So far the Toshiba drive is performing fine, but they have been in place for only 20 days. Next up is the “pod test” where we fill a Storage Pod with Toshiba drives and integrate it into a Backblaze Vault comprised of like-sized drives. We hope to have a better look at the Toshiba 8 TB drives in our Q2 report — stay tuned.

Lifetime Hard Drive Reliability Statistics

While the quarterly chart presented earlier gets a lot of interest, the real test of any drive model is over time. Below is the lifetime failure rate chart for all the hard drive models which have 45 or more drives in operation as of March 31st, 2018. For each model, we compute their reliability starting from when they were first installed.

Lifetime Hard Drive Failure Rates

Notes and Observations

The failure rates of all of the larger drives (8-, 10- and 12 TB) are very good, 1.2% AFR (Annualized Failure Rate) or less. Many of these drives were deployed in the last year, so there is some volatility in the data, but you can use the Confidence Interval to get a sense of the failure percentage range.

The overall failure rate of 1.84% is the lowest we have ever achieved, besting the previous low of 2.00% from the end of 2017.

Our regular readers and drive stats wonks may have noticed a sizable jump in the number of HGST 8 TB drives (model: HUH728080ALE600), from 45 last quarter to 1,045 this quarter. As the 10 TB and 12 TB drives become more available, the price per terabyte of the 8 TB drives has gone down. This presented an opportunity to purchase the HGST drives at a price in line with our budget.

We purchased and placed into service the 45 original HGST 8 TB drives in Q2 of 2015. They were our first Helium-filled drives and our only ones until the 10 TB and 12 TB Seagate drives arrived in Q3 2017. We’ll take a first look into whether or not Helium makes a difference in drive failure rates in an upcoming blog post.

New SMART Attributes

If you have previously worked with the hard drive stats data or plan to, you’ll notice that we added 10 more columns of data starting in 2018. There are 5 new SMART attributes we are tracking each with a raw and normalized value:

  • 177 – Wear Range Delta
  • 179 – Used Reserved Block Count Total
  • 181- Program Fail Count Total or Non-4K Aligned Access Count
  • 182 – Erase Fail Count
  • 235 – Good Block Count AND System(Free) Block Count

The 5 values are all related to SSD drives.

Yes, SSD drives, but before you jump to any conclusions, we used 10 Samsung 850 EVO SSDs as boot drives for a period of time in Q1. This was an experiment to see if we could reduce boot up time for the Storage Pods. In our case, the improved boot up speed wasn’t worth the SSD cost, but it did add 10 new columns to the hard drive stats data.

Speaking of hard drive stats data, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone. It is free.

If you just want the summarized data used to create the tables and charts in this blog post, you can download the ZIP file containing the MS Excel spreadsheet.

Good luck and let us know if you find anything interesting.

[Ed: 5/1/2018 – Updated Lifetime chart to fix error in confidence interval for HGST 4TB drive, model: HDS5C4040ALE630]

The post Hard Drive Stats for Q1 2018 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

A Day in the Life of Michele, Human Resources Coordinator at Backblaze

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/day-in-life-human-resources-coordinator/

Michele, HR Coordinator at Backblaze

Most of the time this blog is dedicated to cloud storage and computer backup topics, but we also want our readers to understand the culture and people at Backblaze who all contribute to keeping our company running and making it an enjoyable place to work. We invited our HR Coordinator, Michele, to talk about how she spends her day searching for great candidates to fill employment positions at Backblaze.

What’s a Typical Day for Michele at Backblaze?

After I’ve had a yummy cup of coffee — maybe with a honey and splash of half and half, I’ll generally start my day reviewing resumes and contacting potential candidates to set up an initial phone screen.

When I start the process of filling a position, I’ll spend a lot of time on the phone speaking with potential candidates. During a phone screen call we’ll chat about their experience, background and what they are ideally looking for in their next position. I also ask about what they like to do outside of work, and most importantly, how they feel about office dogs. A candidate may not always look great on paper, but could turn out to be a great cultural fit after speaking with them about their previous experience and what they’re passionate about.

Next, I push strong candidates to the subsequent steps with the hiring managers, which range from setting up a second phone screen, to setting up a Google hangout for completing coding tasks, to scheduling in-person interviews with the team.

At the end of the day after an in-person interview, I’ll check in with all the interviewers to debrief and decide how to proceed with the candidate. Everyone that interviewed the candidate will get together to give feedback. Is there a good cultural fit? Are they someone we’d like to work with? Keeping in contact with the candidates throughout the process and making sure they are organized and informed is a big part of my job. No one likes to wait around and wonder where they are in the process.

In between all the madness, I’ll put together offer letters, send out onboarding paperwork and links, and get all the necessary signatures to move forward.

On the candidate’s first day, I’ll go over benefits and the handbook and make sure everything is going smoothly in their overall orientation as they transition into their new role here at Backblaze!

What Makes Your Job Exciting?

  • I get to speak with many different types of people and see what makes them tick and if they’d be a good fit at Backblaze
  • The fast pace of the job
  • Being constantly kept busy with different tasks including supporting the FUN committee by researching venues and ideas for family day and the holiday party
  • I work on enjoyable projects like creating a people wall for new hires so we are able to put a face to the name
  • Getting to take a mini road trip up to Sacramento each month to check in with the data center employees
  • Constantly learning more and more about the job, the people, and the company

We’re growing rapidly and always looking for great people to join our team at Backblaze. Our team places a premium on open communications, being cleverly unconventional, and helping each other out.

Oh! We also offer competitive salaries, stock options, and amazing benefits.

Which Job Openings are You Currently Trying to Fill?

We are currently looking for the following positions. If you’re interested, please review the job description on our jobs page and then contact me at jobscontact@backblaze.com.

  • Engineering Director
  • Senior Java Engineer
  • Senior Software Engineer
  • Desktop and Laptop Windows Client Programmer
  • Senior Systems Administrator
  • Sales Development Representative

Thanks Michele!

The post A Day in the Life of Michele, Human Resources Coordinator at Backblaze appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Ransomware Update: Viruses Targeting Business IT Servers

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/ransomware-update-viruses-targeting-business-it-servers/

Ransomware warning message on computer

As ransomware attacks have grown in number in recent months, the tactics and attack vectors also have evolved. While the primary method of attack used to be to target individual computer users within organizations with phishing emails and infected attachments, we’re increasingly seeing attacks that target weaknesses in businesses’ IT infrastructure.

How Ransomware Attacks Typically Work

In our previous posts on ransomware, we described the common vehicles used by hackers to infect organizations with ransomware viruses. Most often, downloaders distribute trojan horses through malicious downloads and spam emails. The emails contain a variety of file attachments, which if opened, will download and run one of the many ransomware variants. Once a user’s computer is infected with a malicious downloader, it will retrieve additional malware, which frequently includes crypto-ransomware. After the files have been encrypted, a ransom payment is demanded of the victim in order to decrypt the files.

What’s Changed With the Latest Ransomware Attacks?

In 2016, a customized ransomware strain called SamSam began attacking the servers in primarily health care institutions. SamSam, unlike more conventional ransomware, is not delivered through downloads or phishing emails. Instead, the attackers behind SamSam use tools to identify unpatched servers running Red Hat’s JBoss enterprise products. Once the attackers have successfully gained entry into one of these servers by exploiting vulnerabilities in JBoss, they use other freely available tools and scripts to collect credentials and gather information on networked computers. Then they deploy their ransomware to encrypt files on these systems before demanding a ransom. Gaining entry to an organization through its IT center rather than its endpoints makes this approach scalable and especially unsettling.

SamSam’s methodology is to scour the Internet searching for accessible and vulnerable JBoss application servers, especially ones used by hospitals. It’s not unlike a burglar rattling doorknobs in a neighborhood to find unlocked homes. When SamSam finds an unlocked home (unpatched server), the software infiltrates the system. It is then free to spread across the company’s network by stealing passwords. As it transverses the network and systems, it encrypts files, preventing access until the victims pay the hackers a ransom, typically between $10,000 and $15,000. The low ransom amount has encouraged some victimized organizations to pay the ransom rather than incur the downtime required to wipe and reinitialize their IT systems.

The success of SamSam is due to its effectiveness rather than its sophistication. SamSam can enter and transverse a network without human intervention. Some organizations are learning too late that securing internet-facing services in their data center from attack is just as important as securing endpoints.

The typical steps in a SamSam ransomware attack are:

1
Attackers gain access to vulnerable server
Attackers exploit vulnerable software or weak/stolen credentials.
2
Attack spreads via remote access tools
Attackers harvest credentials, create SOCKS proxies to tunnel traffic, and abuse RDP to install SamSam on more computers in the network.
3
Ransomware payload deployed
Attackers run batch scripts to execute ransomware on compromised machines.
4
Ransomware demand delivered requiring payment to decrypt files
Demand amounts vary from victim to victim. Relatively low ransom amounts appear to be designed to encourage quick payment decisions.

What all the organizations successfully exploited by SamSam have in common is that they were running unpatched servers that made them vulnerable to SamSam. Some organizations had their endpoints and servers backed up, while others did not. Some of those without backups they could use to recover their systems chose to pay the ransom money.

Timeline of SamSam History and Exploits

Since its appearance in 2016, SamSam has been in the news with many successful incursions into healthcare, business, and government institutions.

March 2016
SamSam appears

SamSam campaign targets vulnerable JBoss servers
Attackers hone in on healthcare organizations specifically, as they’re more likely to have unpatched JBoss machines.

April 2016
SamSam finds new targets

SamSam begins targeting schools and government.
After initial success targeting healthcare, attackers branch out to other sectors.

April 2017
New tactics include RDP

Attackers shift to targeting organizations with exposed RDP connections, and maintain focus on healthcare.
An attack on Erie County Medical Center costs the hospital $10 million over three months of recovery.
Erie County Medical Center attacked by SamSam ransomware virus

January 2018
Municipalities attacked

• Attack on Municipality of Farmington, NM.
• Attack on Hancock Health.
Hancock Regional Hospital notice following SamSam attack
• Attack on Adams Memorial Hospital
• Attack on Allscripts (Electronic Health Records), which includes 180,000 physicians, 2,500 hospitals, and 7.2 million patients’ health records.

February 2018
Attack volume increases

• Attack on Davidson County, NC.
• Attack on Colorado Department of Transportation.
SamSam virus notification

March 2018
SamSam shuts down Atlanta

• Second attack on Colorado Department of Transportation.
• City of Atlanta suffers a devastating attack by SamSam.
The attack has far-reaching impacts — crippling the court system, keeping residents from paying their water bills, limiting vital communications like sewer infrastructure requests, and pushing the Atlanta Police Department to file paper reports.
Atlanta Ransomware outage alert
• SamSam campaign nets $325,000 in 4 weeks.
Infections spike as attackers launch new campaigns. Healthcare and government organizations are once again the primary targets.

How to Defend Against SamSam and Other Ransomware Attacks

The best way to respond to a ransomware attack is to avoid having one in the first place. If you are attacked, making sure your valuable data is backed up and unreachable by ransomware infection will ensure that your downtime and data loss will be minimal or none if you ever suffer an attack.

In our previous post, How to Recover From Ransomware, we listed the ten ways to protect your organization from ransomware.

  1. Use anti-virus and anti-malware software or other security policies to block known payloads from launching.
  2. Make frequent, comprehensive backups of all important files and isolate them from local and open networks. Cybersecurity professionals view data backup and recovery (74% in a recent survey) by far as the most effective solution to respond to a successful ransomware attack.
  3. Keep offline backups of data stored in locations inaccessible from any potentially infected computer, such as disconnected external storage drives or the cloud, which prevents them from being accessed by the ransomware.
  4. Install the latest security updates issued by software vendors of your OS and applications. Remember to patch early and patch often to close known vulnerabilities in operating systems, server software, browsers, and web plugins.
  5. Consider deploying security software to protect endpoints, email servers, and network systems from infection.
  6. Exercise cyber hygiene, such as using caution when opening email attachments and links.
  7. Segment your networks to keep critical computers isolated and to prevent the spread of malware in case of attack. Turn off unneeded network shares.
  8. Turn off admin rights for users who don’t require them. Give users the lowest system permissions they need to do their work.
  9. Restrict write permissions on file servers as much as possible.
  10. Educate yourself, your employees, and your family in best practices to keep malware out of your systems. Update everyone on the latest email phishing scams and human engineering aimed at turning victims into abettors.

Please Tell Us About Your Experiences with Ransomware

Have you endured a ransomware attack or have a strategy to avoid becoming a victim? Please tell us of your experiences in the comments.

The post Ransomware Update: Viruses Targeting Business IT Servers appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.