This report is six months old, and I don’t know anything about the organization that produced it, but it has some alarming data about router security.
Conclusion: Our analysis showed that Linux is the most used OS running on more than 90% of the devices. However, many routers are powered by very old versions of Linux. Most devices are still powered with a 2.6 Linux kernel, which is no longer maintained for many years. This leads to a high number of critical and high severity CVEs affecting these devices.
Since Linux is the most used OS, exploit mitigation techniques could be enabled very easily. Anyhow, they are used quite rarely by most vendors except the NX feature.
A published private key provides no security at all. Nonetheless, all but one vendor spread several private keys in almost all firmware images.
Mirai used hard-coded login credentials to infect thousands of embedded devices in the last years. However, hard-coded credentials can be found in many of the devices and some of them are well known or at least easy crackable.
However, we can tell for sure that the vendors prioritize security differently. AVM does better job than the other vendors regarding most aspects. ASUS and Netgear do a better job in some aspects than D-Link, Linksys, TP-Link and Zyxel.
Additionally, our evaluation showed that large scale automated security analysis of embedded devices is possible today utilizing just open source software. To sum it up, our analysis shows that there is no router without flaws and there is no vendor who does a perfect job regarding all security aspects. Much more effort is needed to make home routers as secure as current desktop of server systems.
One-third ship with Linux kernel version 2.6.36 was released in October 2010. You can walk into a store today and buy a brand new router powered by software that’s almost 10 years out of date! This outdated version of the Linux kernel has 233 known security vulnerabilities registered in the Common Vulnerability and Exposures (CVE) database. The average router contains 26 critically-rated security vulnerabilities, according to the study.
We know the reasons for this. Most routers are designed offshore, by third parties, and then private labeled and sold by the vendors you’ve heard of. Engineering teams come together, design and build the router, and then disperse. There’s often no one around to write patches, and most of the time router firmware isn’t even patchable. The way to update your home router is to throw it away and buy a new one.
And this paper demonstrates that even the new ones aren’t likely to be secure.
Apple did not document the changes but Groß said he fiddled around with the newest iOS 14 and found that Apple shipped a “significant refactoring of iMessage processing” that severely cripples the usual ways exploits are chained together for zero-click attacks.
Groß notes that memory corruption based zero-click exploits typically require exploitation of multiple vulnerabilities to create exploit chains. In most observed attacks, these could include a memory corruption vulnerability, reachable without user interaction and ideally without triggering any user notifications; a way to break ASLR remotely; a way to turn the vulnerability into remote code execution;; and a way to break out of any sandbox, typically by exploiting a separate vulnerability in another operating system component (e.g. a userspace service or the kernel).
Evil Clippy is a tool for creating malicious Microsoft Office macros:
At BlackHat Asia we released Evil Clippy, a tool which assists red teamers and security testers in creating malicious MS Office documents. Amongst others, Evil Clippy can hide VBA macros, stomp VBA code (via p-code) and confuse popular macro analysis tools. It runs on Linux, OSX and Windows.
The VBA stomping is the most powerful feature, because it gets around antivirus programs:
VBA stomping abuses a feature which is not officially documented: the undocumented PerformanceCache part of each module stream contains compiled pseudo-code (p-code) for the VBA engine. If the MS Office version specified in the _VBA_PROJECT stream matches the MS Office version of the host program (Word or Excel) then the VBA source code in the module stream is ignored and the p-code is executed instead.
In summary: if we know the version of MS Office of a target system (e.g. Office 2016, 32 bit), we can replace our malicious VBA source code with fake code, while the malicious code will still get executed via p-code. In the meantime, any tool analyzing the VBA source code (such as antivirus) is completely fooled.
To better understand influence attacks, we proposed an approach that models democracy itself as an information system and explains how democracies are vulnerable to certain forms of information attacks that autocracies naturally resist. Our model combines ideas from both international security and computer security, avoiding the limitations of both in explaining how influence attacks may damage democracy as a whole.
Our initial account is necessarily limited. Building a truly comprehensive understanding of democracy as an information system will be a Herculean labor, involving the collective endeavors of political scientists and theorists, computer scientists, scholars of complexity, and others.
In this short paper, we undertake a more modest task: providing policy advice to improve the resilience of democracy against these attacks. Specifically, we can show how policy makers not only need to think about how to strengthen systems against attacks, but also need to consider how these efforts intersect with public beliefs — or common political knowledge — about these systems, since public beliefs may themselves be an important vector for attacks.
In democracies, many important political decisions are taken by ordinary citizens (typically, in electoral democracies, by voting for political representatives). This means that citizens need to have some shared understandings about their political system, and that the society needs some means of generating shared information regarding who their citizens are and what they want. We call this common political knowledge, and it is largely generated through mechanisms of social aggregation (and the institutions that implement them), such as voting, censuses, and the like. These are imperfect mechanisms, but essential to the proper functioning of democracy. They are often compromised or non-existent in autocratic regimes, since they are potentially threatening to the rulers.
In modern democracies, the most important such mechanism is voting, which aggregates citizens’ choices over competing parties and politicians to determine who is to control executive power for a limited period. Another important mechanism is the census process, which play an important role in the US and in other democracies, in providing broad information about the population, in shaping the electoral system (through the allocation of seats in the House of Representatives), and in policy making (through the allocation of government spending and resources). Of lesser import are public commenting processes, through which individuals and interest groups can comment on significant public policy and regulatory decisions.
All of these systems are vulnerable to attack. Elections are vulnerable to a variety of illegal manipulations, including vote rigging. However, many kinds of manipulation are currently legal in the US, including many forms of gerrymandering, gimmicking voting time, allocating polling booths and resources so as to advantage or disadvantage particular populations, imposing onerous registration and identity requirements, and so on.
Censuses may be manipulated through the provision of bogus information or, more plausibly, through the skewing of policy or resources so that some populations are undercounted. Many of the political battles over the census over the past few decades have been waged over whether the census should undertake statistical measures to counter undersampling bias for populations who are statistically less likely to return census forms, such as minorities and undocumented immigrants. Current efforts to include a question about immigration status may make it less likely that undocumented or recent immigrants will return completed forms.
Finally, public commenting systems too are vulnerable to attacks intended to misrepresent the support for or opposition to specific proposals, including the formation of astroturf (artificial grassroots) groups and the misuse of fake or stolen identities in large-scale mail, fax, email or online commenting systems.
All these attacks are relatively well understood, even if policy choices might be improved by a better understanding of their relationship to shared political knowledge. For example, some voting ID requirements are rationalized through appeals to security concerns about voter fraud. While political scientists have suggested that these concerns are largely unwarranted, we currently lack a framework for evaluating the trade-offs, if any. Computer security concepts such as confidentiality, integrity, and availability could be combined with findings from political science and political theory to provide such a framework.
Even so, the relationship between social aggregation institutions and public beliefs is far less well understood by policy makers. Even when social aggregation mechanisms and institutions are robust against direct attacks, they may be vulnerable to more indirect attacks aimed at destabilizing public beliefs about them.
Democratic societies are vulnerable to (at least) two kinds of knowledge attacks that autocratic societies are not. First are flooding attacks that create confusion among citizens about what other citizens believe, making it far more difficult for them to organize among themselves. Second are confidence attacks. These attempt to undermine public confidence in the institutions of social aggregation, so that their results are no longer broadly accepted as legitimate representations of the citizenry.
Most obviously, democracies will function poorly when citizens do not believe that voting is fair. This makes democracies vulnerable to attacks aimed at destabilizing public confidence in voting institutions. For example, some of Russia’s hacking efforts against the 2016 presidential election were designed to undermine citizens’ confidence in the result. Russian hacking attacks against Ukraine, which targeted the systems through which election results were reported out, were intended to create confusion among voters about what the outcome actually was. Similarly, the “Guccifer 2.0” hacking identity, which has been attributed to Russian military intelligence, sought to suggest that the US electoral system had been compromised by the Democrats in the days immediately before the presidential vote. If, as expected, Donald Trump had lost the election, these claims could have been combined with the actual evidence of hacking to create the appearance that the election was fundamentally compromised.
Similar attacks against the perception of fairness are likely to be employed against the 2020 US census. Should efforts to include a citizenship question fail, some political actors who are disadvantaged by demographic changes such as increases in foreign-born residents and population shift from rural to urban and suburban areas will mount an effort to delegitimize the census results. Again, the genuine problems with the census, which include not only the citizenship question controversy but also serious underfunding, may help to bolster these efforts.
Mechanisms that allow interested actors and ordinary members of the public to comment on proposed policies are similarly vulnerable. For example, the Federal Communication Commission (FCC) announced in 2017 that it was proposing to repeal its net neutrality ruling. Interest groups backing the FCC rollback correctly anticipated a widespread backlash from a politically active coalition of net neutrality supporters. The result was warfare through public commenting. More than 22 million comments were filed, most of which appeared to be either automatically generated or form letters. Millions of these comments were apparently fake, and attached unsuspecting people’s names and email addresses to comments supporting the FCC’s repeal efforts. The vast majority of comments that were not either form letters or automatically generated opposed the FCC’s proposed ruling. The furor around the commenting process was magnified by claims from inside the FCC (later discredited) that the commenting process had also been subjected to a cyberattack.
We do not yet know the identity and motives of the actors behind the flood of fake comments, although the New York State Attorney-General’s office has issued subpoenas for records from a variety of lobbying and advocacy organizations. However, by demonstrating that the commenting process was readily manipulated, the attack made it less likely that the apparently genuine comments of those opposing the FCC’s proposed ruling would be treated as useful evidence of what the public believed. The furor over purported cyberattacks, and the FCC’s unwillingness itself to investigate the attack, have further undermined confidence in an online commenting system that was intended to make the FCC more open to the US public.
We do not know nearly enough about how democracies function as information systems. Generating a better understanding is itself a major policy challenge, which will require substantial resources and, even more importantly, common understandings and shared efforts across a variety of fields of knowledge that currently don’t really engage with each other.
However, even this basic sketch of democracy’s informational aspects can provide policy makers with some key lessons. The most important is that it may be as important to bolster shared public beliefs about key institutions such as voting, public commenting, and census taking against attack, as to bolster the mechanisms and related institutions themselves.
Specifically, many efforts to mitigate attacks against democratic systems begin with spreading public awareness and alarm about their vulnerabilities. This has the benefit of increasing awareness about real problems, but it may especially if exaggerated for effect damage public confidence in the very social aggregation institutions it means to protect. This may mean, for example, that public awareness efforts about Russian hacking that are based on flawed analytic techniques may themselves damage democracy by exaggerating the consequences of attacks.
More generally, this poses important challenges for policy efforts to secure social aggregation institutions against attacks. How can one best secure the systems themselves without damaging public confidence in them? At a minimum, successful policy measures will not simply identify problems in existing systems, but provide practicable, publicly visible, and readily understandable solutions to mitigate them.
We have focused on the problem of confidence attacks in this short essay, because they are both more poorly understood and more profound than flooding attacks. Given historical experience, democracy can probably survive some amount of disinformation about citizens’ beliefs better than it can survive attacks aimed at its core institutions of aggregation. Policy makers need a better understanding of the relationship between political institutions and social beliefs: specifically, the importance of the social aggregation institutions that allow democracies to understand themselves.
There are some low-hanging fruit. Very often, hardening these institutions against attacks on their confidence will go hand in hand with hardening them against attacks more generally. Thus, for example, reforms to voting that require permanent paper ballots and random auditing would not only better secure voting against manipulation, but would have moderately beneficial consequences for public beliefs too.
There are likely broadly similar solutions for public commenting systems. Here, the informational trade-offs are less profound than for voting, since there is no need to balance the requirement for anonymity (so that no-one can tell who voted for who ex post) against other requirements (to ensure that no-one votes twice or more, no votes are changed and so on). Instead, the balance to be struck is between general ease of access and security, making it easier, for example, to leverage secondary sources to validate identity.
Both the robustness of and public confidence in the US census and the other statistical systems that guide the allocation of resources could be improved by insulating them better from political control. For example, a similar system could be used to appoint the director of the census to that for the US Comptroller-General, requiring bipartisan agreement for appointment, and making it hard to exert post-appointment pressure on the official.
Our arguments also illustrate how some well-intentioned efforts to combat social influence operations may have perverse consequences for general social beliefs. The perception of security is at least as important as the reality of security, and any defenses against information attacks need to address both.
However, we need far better developed intellectual tools if we are to properly understand the trade-offs, instead of proposing clearly beneficial policies, and avoiding straightforward mistakes. Forging such tools will require computer security specialists to start thinking systematically about public beliefs as an integral part of the systems that they seek to defend. It will mean that more military oriented cybersecurity specialists need to think deeply about the functioning of democracy and the capacity of internal as well as external actors to disrupt it, rather than reaching for their standard toolkit of state-level deterrence tools. Finally, specialists in the workings of democracy have to learn how to think about democracy and its trade-offs in specifically informational terms.
Linus has released the 4.17 kernel, which will indeed be called “4.17”. “No, I didn’t call it 5.0, even though all the git object count numerology was in place for that. It will happen in the not _too_ distant future, and I’m told all the release scripts on kernel.org are ready for it, but I didn’t feel there was any real reason for it.”
Headline features in this release include improved load estimation in the CPU scheduler, raw BPF tracepoints, lazytime support in the XFS filesystem, full in-kernel TLS protocol support, histogram triggers for tracing, mitigations for the latest Spectre variants, and, of course, the removal of support for eight unloved processor architectures.
A new PGP vulnerability was announced today. Basically, the vulnerability makes use of the fact that modern e-mail programs allow for embedded HTML objects. Essentially, if an attacker can intercept and modify a message in transit, he can insert code that sends the plaintext in a URL to a remote website. Very clever.
The EFAIL attacks exploit vulnerabilities in the OpenPGP and S/MIME standards to reveal the plaintext of encrypted emails. In a nutshell, EFAIL abuses active content of HTML emails, for example externally loaded images or styles, to exfiltrate plaintext through requested URLs. To create these exfiltration channels, the attacker first needs access to the encrypted emails, for example, by eavesdropping on network traffic, compromising email accounts, email servers, backup systems or client computers. The emails could even have been collected years ago.
The attacker changes an encrypted email in a particular way and sends this changed encrypted email to the victim. The victim’s email client decrypts the email and loads any external content, thus exfiltrating the plaintext to the attacker.
A few initial comments:
1. Being able to intercept and modify e-mails in transit is the sort of thing the NSA can do, but is hard for the average hacker. That being said, there are circumstances where someone can modify e-mails. I don’t mean to minimize the seriousness of this attack, but that is a consideration.
2. The vulnerability isn’t with PGP or S/MIME itself, but in the way they interact with modern e-mail programs. You can see this in the two suggested short-term mitigations: “No decryption in the e-mail client,” and “disable HTML rendering.”
3. I’ve been getting some weird press calls from reporters wanting to know if this demonstrates that e-mail encryption is impossible. No, this just demonstrates that programmers are human and vulnerabilities are inevitable. PGP almost certainly has fewer bugs than your average piece of software, but it’s not bug free.
3. Why is anyone using encrypted e-mail anymore, anyway? Reliably and easily encrypting e-mail is an insurmountably hard problem for reasons having nothing to do with today’s announcement. If you need to communicate securely, use Signal. If having Signal on your phone will arouse suspicion, use WhatsApp.
I’ll post other commentaries and analyses as I find them.
Ever since kernel page-table isolation (PTI) was introduced as a mitigation for the Meltdown CPU vulnerability, users have worried about how it affects the performance of their systems. Most of that concern has been directed toward its impact on computing performance, but I/O performance also matters. At the 2018 Linux Storage, Filesystem, and Memory-Management Summit, Ming Lei presented some preliminary work he has done to try to quantify how severely PTI affects block I/O operations.
Last year I wrote about the Digital Security Exchange. The project is live:
The DSX works to strengthen the digital resilience of U.S. civil society groups by improving their understanding and mitigation of online threats.
We do this by pairing civil society and social sector organizations with credible and trustworthy digital security experts and trainers who can help them keep their data and networks safe from exposure, exploitation, and attack. We are committed to working with community-based organizations, legal and journalistic organizations, civil rights advocates, local and national organizers, and public and high-profile figures who are working to advance social, racial, political, and economic justice in our communities and our world.
If you are either an organization who needs help, or an expert who can provide help, visit their website.
The OpenBSD 6.3 release is out. “The release was scheduled for April 15, but since all the components are ready ahead of schedule it is being released now.” This release includes mitigation for the Meltdown vulnerability but not for Spectre on x86.
Linus has released the 4.16 kernel, as expected. “We had a number of fixes and cleanups elsewhere, but none of it made me go ‘uhhuh, better let this soak for another week’“. Some of the headline changes in this release include initial support for the Jailhouse hypervisor, the usercopy whitelisting hardening patches, some improvements to the deadline scheduler and, of course, a lot of Meltdown and Spectre mitigation work.
Version 6.0.0 of the LLVM compiler suite is out. “This release is the result of the community’s work over the past six months, including: retpoline Spectre variant 2 mitigation, significantly improved CodeView debug info for Windows, GlobalISel by default for AArch64 at -O0, improved scheduling on several x86 micro-architectures, Clang defaults to -std=gnu++14 instead of -std=gnu++98, support for some upcoming C++2a features, improved optimizations, new compiler warnings, many bug fixes, and more.”
DDoS vandals have long intensified their attacks by sending a small number of specially designed data packets to publicly available services. The services then unwittingly respond by sending a much larger number of unwanted packets to a target. The best known vectors for these DDoS amplification attacks are poorly secured domain name system resolution servers, which magnify volumes by as much as 50 fold, and network time protocol, which increases volumes by about 58 times.
On Tuesday, researchers reported attackers are abusing a previously obscure method that delivers attacks 51,000 times their original size, making it by far the biggest amplification method ever used in the wild. The vector this time is memcached, a database caching system for speeding up websites and networks. Over the past week, attackers have started abusing it to deliver DDoSes with volumes of 500 gigabits per second and bigger, DDoS mitigation service Arbor Networks reported in a blog post.
Product security is an interesting animal: it is a uniquely cross-disciplinary endeavor that spans policy, consulting, process automation, in-depth software engineering, and cutting-edge vulnerability research. And in contrast to many other specializations in our field of expertise – say, incident response or network security – we have virtually no time-tested and coherent frameworks for setting it up within a company of any size.
In my previous post, I shared some thoughts on nurturing technical organizations and cultivating the right kind of leadership within. Today, I figured it would be fitting to follow up with several notes on what I learned about structuring product security work – and about actually making the effort count.
The “comfort zone” trap
For security engineers, knowing your limits is a sought-after quality: there is nothing more dangerous than a security expert who goes off script and starts dispensing authoritatively-sounding but bogus advice on a topic they know very little about. But that same quality can be destructive when it prevents us from growing beyond our most familiar role: that of a critic who pokes holes in other people’s designs.
The role of a resident security critic lends itself all too easily to a sense of supremacy: the mistaken belief that our cognitive skills exceed the capabilities of the engineers and product managers who come to us for help – and that the cool bugs we file are the ultimate proof of our special gift. We start taking pride in the mere act of breaking somebody else’s software – and then write scathing but ineffectual critiques addressed to executives, demanding that they either put a stop to a project or sign off on a risk. And hey, in the latter case, they better brace for our triumphant “I told you so” at some later date.
Of course, escalations of this type have their place, but they need to be a very rare sight; when practiced routinely, they are a telltale sign of a dysfunctional team. We might be failing to think up viable alternatives that are in tune with business or engineering needs; we might be very unpersuasive, failing to communicate with other rational people in a language they understand; or it might be that our tolerance for risk is badly out of whack with the rest of the company. Whatever the cause, I’ve seen high-level escalations where the security team spoke of valiant efforts to resist inexplicably awful design decisions or data sharing setups; and where product leads in turn talked about pressing business needs randomly blocked by obstinate security folks. Sometimes, simply having them compare their notes would be enough to arrive at a technical solution – such as sharing a less sensitive subset of the data at hand.
To be effective, any product security program must be rooted in a partnership with the rest of the company, focused on helping them get stuff done while eliminating or reducing security risks. To combat the toxic us-versus-them mentality, I found it helpful to have some team members with software engineering backgrounds, even if it’s the ownership of a small open-source project or so. This can broaden our horizons, helping us see that we all make the same mistakes – and that not every solution that sounds good on paper is usable once we code it up.
Getting off the treadmill
All security programs involve a good chunk of operational work. For product security, this can be a combination of product launch reviews, design consulting requests, incoming bug reports, or compliance-driven assessments of some sort. And curiously, such reactive work also has the property of gradually expanding to consume all the available resources on a team: next year is bound to bring even more review requests, even more regulatory hurdles, and even more incoming bugs to triage and fix.
Being more tractable, such routine tasks are also more readily enshrined in SDLs, SLAs, and all kinds of other official documents that are often mistaken for a mission statement that justifies the existence of our teams. Soon, instead of explaining to a developer why they should fix a particular problem right away, we end up pointing them to page 17 in our severity classification guideline, which defines that “severity 2” vulnerabilities need to be resolved within a month. Meanwhile, another policy may be telling them that they need to run a fuzzer or a web application scanner for a particular number of CPU-hours – no matter whether it makes sense or whether the job is set up right.
To run a product security program that scales sublinearly, stays abreast of future threats, and doesn’t erect bureaucratic speed bumps just for the sake of it, we need to recognize this inherent tendency for operational work to take over – and we need to reign it in. No matter what the last year’s policy says, we usually don’t need to be doing security reviews with a particular cadence or to a particular depth; if we need to scale them back 10% to staff a two-quarter project that fixes an important API and squashes an entire class of bugs, it’s a short-term risk we should feel empowered to take.
As noted in my earlier post, I find contingency planning to be a valuable tool in this regard: why not ask ourselves how the team would cope if the workload went up another 30%, but bad financial results precluded any team growth? It’s actually fun to think about such hypotheticals ahead of the time – and hey, if the ideas sound good, why not try them out today?
Living for a cause
It can be difficult to understand if our security efforts are structured and prioritized right; when faced with such uncertainty, it is natural to stick to the safe fundamentals – investing most of our resources into the very same things that everybody else in our industry appears to be focusing on today.
I think it’s important to combat this mindset – and if so, we might as well tackle it head on. Rather than focusing on tactical objectives and policy documents, try to write down a concise mission statement explaining why you are a team in the first place, what specific business outcomes you are aiming for, how do you prioritize it, and how you want it all to change in a year or two. It should be a fluid narrative that reads right and that everybody on your team can take pride in; my favorite way of starting the conversation is telling folks that we could always have a new VP tomorrow – and that the VP’s first order of business could be asking, “why do you have so many people here and how do I know they are doing the right thing?”. It’s a playful but realistic framing device that motivates people to get it done.
In general, a comprehensive product security program should probably start with the assumption that no matter how many resources we have at our disposal, we will never be able to stay in the loop on everything that’s happening across the company – and even if we did, we’re not going to be able to catch every single bug. It follows that one of our top priorities for the team should be making sure that bugs don’t happen very often; a scalable way of getting there is equipping engineers with intuitive and usable tools that make it easy to perform common tasks without having to worry about security at all. Examples include standardized, managed containers for production jobs; safe-by-default APIs, such as strict contextual autoescaping for XSS or type safety for SQL; security-conscious style guidelines; or plug-and-play libraries that take care of common crypto or ACL enforcement tasks.
Of course, not all problems can be addressed on framework level, and not every engineer will always reach for the right tools. Because of this, the next principle that I found to be worth focusing on is containment and mitigation: making sure that bugs are difficult to exploit when they happen, or that the damage is kept in check. The solutions in this space can range from low-level enhancements (say, hardened allocators or seccomp-bpf sandboxes) to client-facing features such as browser origin isolation or Content Security Policy.
The usual consulting, review, and outreach tasks are an important facet of a product security program, but probably shouldn’t be the sole focus of your team. It’s also best to avoid undue emphasis on vulnerability showmanship: while valuable in some contexts, it creates a hypercompetitive environment that may be hostile to less experienced team members – not to mention, squashing individual bugs offers very limited value if the same issue is likely to be reintroduced into the codebase the next day. I like to think of security reviews as a teaching opportunity instead: it’s a way to raise awareness, form partnerships with engineers, and help them develop lasting habits that reduce the incidence of bugs. Metrics to understand the impact of your work are important, too; if your engagements are seen mostly as a yet another layer of red tape, product teams will stop reaching out to you for advice.
The other tenet of a healthy product security effort requires us to recognize at a scale and given enough time, every defense mechanism is bound to fail – and so, we need ways to prevent bugs from turning into incidents. The efforts in this space may range from developing product-specific signals for the incident response and monitoring teams; to offering meaningful vulnerability reward programs and nourishing a healthy and respectful relationship with the research community; to organizing regular offensive exercises in hopes of spotting bugs before anybody else does.
Oh, one final note: an important feature of a healthy security program is the existence of multiple feedback loops that help you spot problems without the need to micromanage the organization and without being deathly afraid of taking chances. For example, the data coming from bug bounty programs, if analyzed correctly, offers a wonderful way to alert you to systemic problems in your codebase – and later on, to measure the impact of any remediation and hardening work.
This is part one of a series. The second part will be posted later this week. Use the Join button above to receive notification of future posts in this series.
Though most of us have never set foot inside of a data center, as citizens of a data-driven world we nonetheless depend on the services that data centers provide almost as much as we depend on a reliable water supply, the electrical grid, and the highway system. Every time we send a tweet, post to Facebook, check our bank balance or credit score, watch a YouTube video, or back up a computer to the cloud we are interacting with a data center.
In this series, The Challenges of Opening a Data Center, we’ll talk in general terms about the factors that an organization needs to consider when opening a data center and the challenges that must be met in the process. Many of the factors to consider will be similar for opening a private data center or seeking space in a public data center, but we’ll assume for the sake of this discussion that our needs are more modest than requiring a data center dedicated solely to our own use (i.e. we’re not Google, Facebook, or China Telecom).
Data center technology and management are changing rapidly, with new approaches to design and operation appearing every year. This means we won’t be able to cover everything happening in the world of data centers in our series, however, we hope our brief overview proves useful.
What is a Data Center?
A data center is the structure that houses a large group of networked computer servers typically used by businesses, governments, and organizations for the remote storage, processing, or distribution of large amounts of data.
While many organizations will have computing services in the same location as their offices that support their day-to-day operations, a data center is a structure dedicated to 24/7 large-scale data processing and handling.
Depending on how you define the term, there are anywhere from a half million data centers in the world to many millions. While it’s possible to say that an organization’s on-site servers and data storage can be called a data center, in this discussion we are using the term data center to refer to facilities that are expressly dedicated to housing computer systems and associated components, such as telecommunications and storage systems. The facility might be a private center, which is owned or leased by one tenant only, or a shared data center that offers what are called “colocation services,” and rents space, services, and equipment to multiple tenants in the center.
A large, modern data center operates around the clock, placing a priority on providing secure and uninterrrupted service, and generally includes redundant or backup power systems or supplies, redundant data communication connections, environmental controls, fire suppression systems, and numerous security devices. Such a center is an industrial-scale operation often using as much electricity as a small town.
Types of Data Centers
There are a number of ways to classify data centers according to how they will be used, whether they are owned or used by one or multiple organizations, whether and how they fit into a topology of other data centers; which technologies and management approaches they use for computing, storage, cooling, power, and operations; and increasingly visible these days: how green they are.
Data centers can be loosely classified into three types according to who owns them and who uses them.
Exclusive Data Centers are facilities wholly built, maintained, operated and managed by the business for the optimal operation of its IT equipment. Some of these centers are well-known companies such as Facebook, Google, or Microsoft, while others are less public-facing big telecoms, insurance companies, or other service providers.
Managed Hosting Providers are data centers managed by a third party on behalf of a business. The business does not own data center or space within it. Rather, the business rents IT equipment and infrastructure it needs instead of investing in the outright purchase of what it needs.
Colocation Data Centers are usually large facilities built to accommodate multiple businesses within the center. The business rents its own space within the data center and subsequently fills the space with its IT equipment, or possibly uses equipment provided by the data center operator.
Backblaze, for example, doesn’t own its own data centers but colocates in data centers owned by others. As Backblaze’s storage needs grow, Backblaze increases the space it uses within a given data center and/or expands to other data centers in the same or different geographic areas.
Availability is Key
When designing or selecting a data center, an organization needs to decide what level of availability is required for its services. The type of business or service it provides likely will dictate this. Any organization that provides real-time and/or critical data services will need the highest level of availability and redundancy, as well as the ability to rapidly failover (transfer operation to another center) when and if required. Some organizations require multiple data centers not just to handle the computer or storage capacity they use, but to provide alternate locations for operation if something should happen temporarily or permanently to one or more of their centers.
Organizations operating data centers that can’t afford any downtime at all will typically operate data centers that have a mirrored site that can take over if something happens to the first site, or they operate a second site in parallel to the first one. These data center topologies are called Active/Passive, and Active/Active, respectively. Should disaster or an outage occur, disaster mode would dictate immediately moving all of the primary data center’s processing to the second data center.
While some data center topologies are spread throughout a single country or continent, others extend around the world. Practically, data transmission speeds put a cap on centers that can be operated in parallel with the appearance of simultaneous operation. Linking two data centers located apart from each other — say no more than 60 miles to limit data latency issues — together with dark fiber (leased fiber optic cable) could enable both data centers to be operated as if they were in the same location, reducing staffing requirements yet providing immediate failover to the secondary data center if needed.
This redundancy of facilities and ensured availability is of paramount importance to those needing uninterrupted data center services.
Leadership in Energy and Environmental Design (LEED) is a rating system devised by the United States Green Building Council (USGBC) for the design, construction, and operation of green buildings. Facilities can achieve ratings of certified, silver, gold, or platinum based on criteria within six categories: sustainable sites, water efficiency, energy and atmosphere, materials and resources, indoor environmental quality, and innovation and design.
Green certification has become increasingly important in data center design and operation as data centers require great amounts of electricity and often cooling water to operate. Green technologies can reduce costs for data center operation, as well as make the arrival of data centers more amenable to environmentally-conscious communities.
The ACT, Inc. data center in Iowa City, Iowa was the first data center in the U.S. to receive LEED-Platinum certification, the highest level available.
ACT Data Center exterior
ACT Data Center interior
Factors to Consider When Selecting a Data Center
There are numerous factors to consider when deciding to build or to occupy space in a data center. Aspects such as proximity to available power grids, telecommunications infrastructure, networking services, transportation lines, and emergency services can affect costs, risk, security and other factors that need to be taken into consideration.
The size of the data center will be dictated by the business requirements of the owner or tenant. A data center can occupy one room of a building, one or more floors, or an entire building. Most of the equipment is often in the form of servers mounted in 19 inch rack cabinets, which are usually placed in single rows forming corridors (so-called aisles) between them. This allows staff access to the front and rear of each cabinet. Servers differ greatly in size from 1U servers (i.e. one “U” or “RU” rack unit measuring 44.50 millimeters or 1.75 inches), to Backblaze’s Storage Pod design that fits a 4U chassis, to large freestanding storage silos that occupy many square feet of floor space.
Location will be one of the biggest factors to consider when selecting a data center and encompasses many other factors that should be taken into account, such as geological risks, neighboring uses, and even local flight paths. Access to suitable available power at a suitable price point is often the most critical factor and the longest lead time item, followed by broadband service availability.
With more and more data centers available providing varied levels of service and cost, the choices increase each year. Data center brokers can be employed to find a data center, just as one might use a broker for home or other commercial real estate.
Websites listing available colocation space, such as upstack.io, or entire data centers for sale or lease, are widely used. A common practice is for a customer to publish its data center requirements, and the vendors compete to provide the most attractive bid in a reverse auction.
Business and Customer Proximity
The center’s closeness to a business or organization may or may not be a factor in the site selection. The organization might wish to be close enough to manage the center or supervise the on-site staff from a nearby business location. The location of customers might be a factor, especially if data transmission speeds and latency are important, or the business or customers have regulatory, political, tax, or other considerations that dictate areas suitable or not suitable for the storage and processing of data.
Local climate is a major factor in data center design because the climatic conditions dictate what cooling technologies should be deployed. In turn this impacts uptime and the costs associated with cooling, which can total as much as 50% or more of a center’s power costs. The topology and the cost of managing a data center in a warm, humid climate will vary greatly from managing one in a cool, dry climate. Nevertheless, data centers are located in both extremely cold regions and extremely hot ones, with innovative approaches used in both extremes to maintain desired temperatures within the center.
Geographic Stability and Extreme Weather Events
A major obvious factor in locating a data center is the stability of the actual site as regards weather, seismic activity, and the likelihood of weather events such as hurricanes, as well as fire or flooding.
Backblaze’s Sacramento data center describes its location as one of the most stable geographic locations in California, outside fault zones and floodplains.
Sometimes the location of the center comes first and the facility is hardened to withstand anticipated threats, such as Equinix’s NAP of the Americas data center in Miami, one of the largest single-building data centers on the planet (six stories and 750,000 square feet), which is built 32 feet above sea level and designed to withstand category 5 hurricane winds.
Equinix “NAP of the Americas” Data Center in Miami
Most data centers don’t have the extreme protection or history of the Bahnhof data center, which is located inside the ultra-secure former nuclear bunker Pionen, in Stockholm, Sweden. It is buried 100 feet below ground inside the White Mountains and secured behind 15.7 in. thick metal doors. It prides itself on its self-described “Bond villain” ambiance.
Bahnhof Data Center under White Mountain in Stockholm
Usually, the data center owner or tenant will want to take into account the balance between cost and risk in the selection of a location. The Ideal quadrant below is obviously favored when making this compromise.
Risk mitigation also plays a strong role in pricing. The extent to which providers must implement special building techniques and operating technologies to protect the facility will affect price. When selecting a data center, organizations must make note of the data center’s certification level on the basis of regulatory requirements in the industry. These certifications can ensure that an organization is meeting necessary compliance requirements.
Electrical power usually represents the largest cost in a data center. The cost a service provider pays for power will be affected by the source of the power, the regulatory environment, the facility size and the rate concessions, if any, offered by the utility. At higher level tiers, battery, generator, and redundant power grids are a required part of the picture.
Fault tolerance and power redundancy are absolutely necessary to maintain uninterrupted data center operation. Parallel redundancy is a safeguard to ensure that an uninterruptible power supply (UPS) system is in place to provide electrical power if necessary. The UPS system can be based on batteries, saved kinetic energy, or some type of generator using diesel or another fuel. The center will operate on the UPS system with another UPS system acting as a backup power generator. If a power outage occurs, the additional UPS system power generator is available.
Many data centers require the use of independent power grids, with service provided by different utility companies or services, to prevent against loss of electrical service no matter what the cause. Some data centers have intentionally located themselves near national borders so that they can obtain redundant power from not just separate grids, but from separate geopolitical sources.
Higher redundancy levels required by a company will of invariably lead to higher prices. If one requires high availability backed by a service-level agreement (SLA), one can expect to pay more than another company with less demanding redundancy requirements.
Stay Tuned for Part 2 of The Challenges of Opening a Data Center
That’s it for part 1 of this post. In subsequent posts, we’ll take a look at some other factors to consider when moving into a data center such as network bandwidth, cooling, and security. We’ll take a look at what is involved in moving into a new data center (including stories from Backblaze’s experiences). We’ll also investigate what it takes to keep a data center running, and some of the new technologies and trends affecting data center design and use. You can discover all posts on our blog tagged with “Data Center” by following the link https://www.backblaze.com/blog/tag/data-center/.
The second part of this series on The Challenges of Opening a Data Center will be posted later this week. Use the Join button above to receive notification of future posts in this series.
Researchers have discovered new variants of Spectre and Meltdown. The software mitigations for Spectre and Meltdown seem to block these variants, although the eventual CPU fixes will have to be expanded to account for these new attacks.
The initial panic over the Meltdown and Spectre processor vulnerabilities has faded, and work on mitigations in the kernel has slowed since our mid-January report. That work has not stopped, though. Fully equipping the kernel to protect systems from these vulnerabilities is a task that may well require years. Read on for an update on the current status of that work.
Linus has released the 4.15 kernel. “After a release cycle that was unusual in so many (bad) ways, this last week was really pleasant. Quiet and small, and no last-minute panics, just small fixes for various issues. I never got a feeling that I’d need to extend things by yet another week, and 4.15 looks fine to me.”
Some of the more significant features in this release include: the long-awaited CPU controller for the version-2 control-group interface, significant live-patching improvements, initial support for the RISC-V architecture, support for AMD’s secure encrypted virtualization feature, and the MAP_SYNC mechanism for working with nonvolatile memory. This release also, of course, includes mitigations for the Meltdown and Spectre variant-2 vulnerabilities though, as Linus points out in the announcement, the work of dealing with these issues is not yet done.
By now, almost everybody has probably seen the press coverage of Linus Torvalds’s remarks about one of the patches addressing Spectre variant 2. Less noted, but much more informative, is David Woodhouse’s response on why those patches are the way they are. “That’s why my initial idea, as implemented in this RFC patchset, was to stick with IBRS on Skylake, and use retpoline everywhere else. I’ll give you ‘garbage patches’, but they weren’t being ‘just mindlessly sent around’. If we’re going to drop IBRS support and accept the caveats, then let’s do it as a conscious decision having seen what it would look like, not just drop it quietly because poor Davey is too scared that Linus might shout at him again.”
While some aspects of the kernel’s defenses against the Meltdown and Spectre vulnerabilities were more-or-less in place when the problems were disclosed on January 3, others were less fully formed. Additionally, many of the mitigations (especially for the two Spectre variants) had not been seen in public prior to the disclosure, meaning that there was a lot of scope for discussion once they came out. Many of those discussions are slowing down, and the kernel’s initial response has mostly come into focus. The 4.15 kernel will include a broad set of mitigations, while some others will have to wait for later; read on for details on where things stand.
The collective thoughts of the interwebz
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.