Tag Archives: Plex

Ethereum, Proof-of-Stake… and the consequences

Post Syndicated from Григор original http://www.gatchev.info/blog/?p=2070

For those who aren’t cryptocurrency-savvy: Ethereum is a cryptocurrency project, based around the coin Ether. It has the support of many big banks, big hedge funds and some states (Russia, China etc). Among the cryptocurrencies, it is second only to Bitcoin – and might even overtake it with the time. (Especially if Bitcoin doesn’t finally move and fix some of its problems.)

Ethereum offers some abilities that few other cryptocurrencies do. The most important one is the support for “smart projects” – kind of electronic contracts that can easily be executed and enforced with little to no human participation. This post however is dedicated to another of its traits – the Proof of Stake.

To work and exist, every cryptocurrency depends on some proof. Most of them use Proof-of-Work scheme. In it, one has to put some work – eg. calculating checksums – behind its participation in the network and its decision, and receive newly generated coins for it. This however results in huge amount of work done only to prove that, well, you can do it and deserve to be in and receive some of the newly squeezed juice.

As of August 2017, Ethereum uses this scheme too. However, they plan to switch to a Proof-of-Stake algorithm named Casper. In it, you prove yourself not by doing work, but by proving to own Ether. As this requires practically no work, it is much more technically effective than the Proof-of-Work schemes.

Technically, Caspar is an amazing design. I congratulate the Ethereum team for it. However, economically its usage appears to have an important weakness. It is described below.

—-

A polarized system

With Casper, the Ether generated by the Ethereum network and the decision power in it are distributed to these who already own Ether. As a consequence, most of both go to those who own most Ether. (There might be attempts to limit that, but these are easily defeatable. For example, limiting the amount distributed to an address can be circumvented by a Sybil attack.)

Such a distribution will create with the time a financial ecosystem where most money and vote are held by a small minority of the participants. The big majority will have little to no of both – it will summarily hold less money and vote than the minority of “haves”. Giving the speed with which the cryptocurrency systems evolve, it is realistic to expect this development in ten, maybe even in five or less years after introducing Casper.

The “middle class”

Economists love to repeat how important is to have a strong middle class. Why, and how that translates to the situation in a cryptocurrency-based financial system?

In systemic terms, “middle class” denotes in a financial system the set of entities that control each a noticeable but not very big amount of resources.

Game theory shows that in a financial system, entities with different clout usually have different interests. These interests usually reflect the amount of resources they control. Entities with little to no resources tend to have interests opposing to these with biggest resources – especially in systems where the total amount of resources changes slowly and the economics is close to a zero-sum game. (For example, in most cryptocurrency systems.) The “middle class” entities interests are in most aspects in the middle.

For an economics to work, there must be a balance of interests that creates incentive for all of its members to participate. In financial systems, where “haves” interests are mostly opposing to “have-nots” interests, creating such a balance depends on the presence and influence of a “middle class”. Its interests are usually the closest to a compromise that satisfies all, and its influence is the key to achieving that compromise within the system.

If the system state is not acceptable for all entities, these who do not accept it eventually leave. (Usually their participation is required for the system survival, so this brings the system down.) If these entities cannot leave the system, they ultimately reject its rules and try to change it by force. If that is impossible too, they usually resort to denying the system what makes them useful for it, thus decreasing its competitiveness to other systems.

The most reliable way to have acceptable compromise enforced in a system is to have in it a “middle class” that summarily controls more resources than any other segment of entities, preferably at least 51% of the system resources. (This assumes that the “middle class” is able and willing to protect their interests. If some of these entities are controlled into defending someone else’s interests – eg. botnets in computer networks, manipulated voters during elections, etc – these numbers apply to the non-controlled among them.)

A system that doesn’t have a non-controlled “middle class” that controls a decisive amount of resources, usually does not have an influential set of interests that are an acceptable compromise between the interests poles. For this reason, it can be called a polarized system.

The limitation on development

In a polarized system, the incentive for development is minimized. (Development is potentially disruptive, and the majority of the financial abilities and the decision power there has only to lose from a disruption. When factoring in the expected profits from development, the situation always becomes a zero-sum game.) The system becomes static (thus cementing the zero-sum game situation in it) and is under threat of being overtaken by a competing financial system. When that happens, it is usually destroyed together with all stakes in it.

Also, almost any initiative in such a financial system is bound to turn into a cartel, oligopoly or monopoly, due to the small number of participants with resources to start and support an initiative. That effectively destroys its markets, contributing to the weakness of the system and limiting further its ability to develop.

Another problem that stems from this is that the incentive during an interaction to violate the rules and to push the contragent into a loss is greater than the incentive to compete by giving a better offer. This in turn removes the incentive to increase productivity, which is a key incentive for development.)

Yet another problem of the concentration of most resources into few entities is the increased gain from attacking one of them and appropriating their resources, and thus the incentive to do it. Since good defensive capabilities are usually an excellent offense base, this pulls the “haves” into an “arms race”, redirecting more and more of their resources into defense. This also leaves the development outside the arms race increasingly resource-strapped. (The “arms race” itself generates development, but the race situation prevents that into trickling into “non-military” applications.)

These are only a part of the constraints on development in a polarized system. Listing all of them will make a long read.

Trickle-up and trickle-down

In theory, every economical system involves two processes: trickle-down and trickle-up. So, any concentration of resources on the top should be decreased by an automatically increased trickle-down. However, a better understanding how these processes work shows that this logic is faulty.

Any financial exchange in a system consists of two parts. One of them covers the actual production cost of whatever resource is being exchanged against the finances. The other part is the profit of the entity that obtains the finances. From the viewpoint of that entity, the first part vs. the resource given is zero-sum – its incentive to participate in this exchange is the second part, the profit. That second part is effectively the trickle in the system, as it is the only resource really gained.

The direction and the size of the trickle ultimately depends on the balance of many factors, some of them random, others constant. On the long run, it is the constant factors that determine the size and the direction of the trickle sum.

The most important constant factor is the benefit of scale (BOS). It dictates that the bigger entities are able to pull the balance to their side more strongly than the smaller ones. Some miss that chance, but others use it. It makes the trickle-up stronger than the trickle-down. In a system where the transaction outcome is close to a zero-sum game, this concentrates all resources at the top with a speed depending on the financial interactions volume per an unit of time.

(Actually the formula is a bit more complex. All dynamic entities – eg. living organisms, active companies etc – have an “existence maintenance” expense, which they cannot avoid. However, the amount of resources in a system above the summary existence maintenance follows the simple rule above. And these are the only resources that are available for investing in anything, eg. development.)

In the real-life systems the BOS power is limited. There are many different random factors that compete with and influence one another, some of them outweighing BOS. Also, in every moment some factors lose importance and / or cease to exist, while others appear and / or gain importance. The complexity of this system makes any attempt by an entity or entities pool to take control over it hard and slow. This gives the other entities time and ways to react and try to block the takeover attempt. Also, the real-life systems have many built-in constraints against scale-based takeovers – anti-trust laws, separation of the government powers, enforced financial trickle-down through taxes on the rich and benefits for the poor, etc. All these together manage to prevent most takeover attempts, or to limit them into only a segment of the system.

How a Proof-of-Stake based cryptocurrency fares at these?

A POS-based cryptocurrency financial system has no constraints against scale-based takeovers. It has only one kind of clout – the amount of resources controlled by an entity. This kind of clout is built in it, has all the importance in it and cannot lose that or disappear. It has no other types of resources, and has no slowing due to complexity. It is not segmented – who has these resources has it all. There are no built-in constraints against scale-based takeovers, or mechanisms to strengthen resource trickle-down. In short, it is the ideal ground for creating a polarized financial system.

So, it would be only logical to expect that a Proof-of-Stake based Ether financial system will suffer by the problems a polarized system presents. Despite all of its technical ingenuity, its longer-term financial usability is limited, and the participation in it may be dangerous to any entity smaller than eg. a big bank, a big hedge fund or a big authoritarian state.

All fixes for this problem I could think of by now would be easily beaten by simple attacks. I am not sure if it is possible to have a reliable solution to it at all.

Do smart contracts and secondary tokens change this?

Unhappily, no. Smart contracts are based on having Ether, and need Ether to exist and act. Thus, they are bound to the financial situation of the Ether financial system, and are influenced by it. The bigger is the scope of the smart contract, the bigger is its dependence on the Ether situation.

Due to this, smart contracts of meaningful size will find themselves hampered and maybe even endangered by a polarization in the financial system powered by POS-based Ethereum. It is technically possible to migrate these contracts to a competing underlying system, but it won’t be easy – probably even when the competing system is technically a clone of Ethereum, like Ethereum Classic. The migration cost might exceed the migration benefits at any given stage of the contract project development, even if the total migration benefits are far larger than this cost.

Eventually this problem might become public knowledge and most projects in need of a smart contract might start avoiding Ethereum. This will lead to decreased interest in participation in the Ethereum ecosystem, to a loss of market cap, and eventually maybe even to the demise of this technically great project.

Other dangers

There is a danger that the “haves” minority in a polarized system might start actively investing resources in creating other systems that suffer from the same problem (as they benefit from it), or in modifying existing systems in this direction. This might decrease the potential for development globally. As some of the backers of Ethereum are entities with enormous clout worldwide, that negative influence on the global system might be significant.

Exponential Backoff And Jitter

Post Syndicated from Marc Brooker original https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/

Introducing OCC

Optimistic concurrency control (OCC) is a time-honored way for multiple writers to safely modify a single object without losing writes. OCC has three nice properties: it will always make progress as long as the underlying store is available, it’s easy to understand, and it’s easy to implement. DynamoDB’s conditional writes make OCC a natural fit for DynamoDB users, and it’s natively supported by the DynamoDBMapper client.

While OCC is guaranteed to make progress, it can still perform quite poorly under high contention. The simplest of these contention cases is when a whole lot of clients start at the same time, and try to update the same database row. With one client guaranteed to succeed every round, the time to complete all the updates grows linearly with contention.

For the graphs in this post, I used a small simulator to model the behavior of OCC on a network with delay (and variance in delay), against a remote database. In this simulation, the network introduces delay with a mean of 10ms and variance of 4ms. The first simulation shows how completion time grows linearly with contention. This linear growth is because one client succeeds every round, so it takes N rounds for all N clients to succeed.

Unfortunately, that’s not the whole picture. With N clients contending, the total amount of work done by the system increases with N2.

Adding Backoff

The problem here is that N clients compete in the first round, N-1 in the second round, and so on. Having every client compete in every round is wasteful. Slowing clients down may help, and the classic way to slow clients down is capped exponential backoff. Capped exponential backoff means that clients multiply their backoff by a constant after each attempt, up to some maximum value. In our case, after each unsuccessful attempt, clients sleep for:

Running the simulation again shows that backoff helps a small amount, but doesn’t solve the problem. Client work has only been reduced slightly.

The best way to see the problem is to look at the times these exponentially backed-off calls happen.

It’s obvious that the exponential backoff is working, in that the calls are happening less and less frequently. The problem also stands out: there are still clusters of calls. Instead of reducing the number of clients competing in every round, we’ve just introduced times when no client is competing. Contention hasn’t been reduced much, although the natural variance in network delay has introduced some spreading.

Adding Jitter

The solution isn’t to remove backoff. It’s to add jitter. Initially, jitter may appear to be a counter-intuitive idea: trying to improve the performance of a system by adding randomness. The time series above makes a great case for jitter – we want to spread out the spikes to an approximately constant rate. Adding jitter is a small change to the sleep function:

That time series looks a whole lot better. The gaps are gone, and beyond the initial spike, there’s an approximately constant rate of calls. It’s also had a great effect on the total number of calls.

In the case with 100 contending clients, we’ve reduced our call count by more than half. We’ve also significantly improved the time to completion, when compared to un-jittered exponential backoff.

There are a few ways to implement these timed backoff loops. Let’s call the algorithm above “Full Jitter”, and consider two alternatives. The first alternative is “Equal Jitter”, where we always keep some of the backoff and jitter by a smaller amount:

The intuition behind this one is that it prevents very short sleeps, always keeping some of the slow down from the backoff. A second alternative is “Decorrelated Jitter”, which is similar to “Full Jitter”, but we also increase the maximum jitter based on the last random value.

Which approach do you think is best?

Looking at the amount of client work, the number of calls is approximately the same for “Full” and “Equal” jitter, and higher for “Decorrelated”. Both cut down work substantially relative to both the no-jitter approaches.

The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches. In fact, it takes so much more time we have to leave it off the graph to get a good comparison of the other methods.

Of the jittered approaches, “Equal Jitter” is the loser. It does slightly more work than “Full Jitter”, and takes much longer. The decision between “Decorrelated Jitter” and “Full Jitter” is less clear. The “Full Jitter” approach uses less work, but slightly more time. Both approaches, though, present a substantial decrease in client work and server load.

It’s worth noting that none of these approaches fundamentally change the N2 nature of the work to be done, but do substantially reduce work at reasonable levels of contention. The return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.

All of the graphs and numbers from this post were generated using a simple simulation of OCC behavior. You can get our simulator code on GitHub, in the aws-arch-backoff-simulator project.

– Marc Brooker

 

Internet Routing and Traffic Engineering

Post Syndicated from Tom Scholl original https://aws.amazon.com/blogs/architecture/internet-routing-and-traffic-engineering/

Internet Routing

Internet routing today is handled through the use of a routing protocol known as BGP (Border Gateway Protocol). Individual networks on the Internet are represented as an autonomous system (AS). An autonomous system has a globally unique autonomous system number (ASN) which is allocated by a Regional Internet Registry (RIR), who also handle allocation of IP addresses to networks. Each individual autonomous system establishes BGP peering sessions to other autonomous systems to exchange routing information. A BGP peering session is a TCP session established between two routers, each one in a particular autonomous system. This BGP peering session rides across a link, such as a 10Gigabit Ethernet interface between those routers. The routing information contains an IP address prefix and subnet mask. This translates which IP addresses are associated with an autonomous system number (AS origin). Routing information propagates across these autonomous systems based upon policies that individual networks define.

This is where things get a bit interesting because various factors influence how routing is handled on the Internet. There are two main types of relationships between autonomous systems today: Transit and Peering.

Transit is where an autonomous system will pay an upstream network (known as a transit provider) for the ability to forward traffic towards them who will forward that traffic further. It also provides for the autonomous system purchasing (who is the customer in this relationship) to have their routing information propagated to their adjacencies. Transit involves obtaining direct connectivity from a customer network to an upstream transit provider network. These sorts of connections can be multiple 10Gigabit Ethernet links between each other’s routers. Transit pricing is based upon network utilization in a particular dominant direction with 95th percentile billing. A transit provider will look at a months worth of utilization and in the traffic dominant direction they will bill on the 95th percentile of utilization. The unit used in billing is measured in bits-per-second (bps) and is communicated in a price per Mbps (for example – $2 per Mbps).

Peering is where an autonomous system will connect to another autonomous system and agree to exchange traffic with each other (and routing information) of their own networks and any customers (transit customers) they have. With peering, there are two methods that connectivity is formed on. The first is where direct connectivity is established between individual networks routers with multiple 10Gigabit Ethernet or 100Gigabit Ethernet links. This sort of connectivity is known as “private peering” or PNI (Private Network Interconnect). This sort of connection provides both parties with clear visibility into the interface utilization of traffic in both directions (inbound and outbound). Another form of peering that is established is via Internet Exchange switches, or IX’s. With an Internet Exchange, multiple networks will obtain direct connectivity into a set of Ethernet switches. Individual networks can establish BGP sessions across this exchange with other participants. The benefit of the Internet Exchange is that it allows multiple networks to connect to a common location and use it for one-to-many connectivity. A downside is that any given network does not have visibility into the network utilization of other participants.

Most networks will deploy their network equipment (routers, Dense Wave Division Multiplexing (DWDM) transport equipment) into colocation facilities where networks will establish direct connectivity to each other. This can be via Internet Exchange switches (which are also found in these colocation facilities) or direct connections which are fiber optics cables ran between individual suites/racks where the network gear is located.

Routing Policy

Networks will define their routing policy to prefer routing to other networks based upon a variety of items. The BGP best path decision process in a routers operating system dictates how a router will prefer one BGP path over another. Network operators will write their policy to influence that BGP best-path decision process based upon factors such as the cost to deliver traffic to a destination network in addition to performance.

A typical routing policy within most networks will dictate that internal (their own) and routes learned from their own customers are to be preferred over all other paths. After that, most networks will then prefer peering routes since peering is typically free and often times can provide a shorter/optimal path to reach a destination. Finally the least preferred route to a destination is over paid transit links. When it comes to transit paths, both cost and performance are typically factors in determining how to reach a destination network.

Routing policies themselves are defined on routers in a simple text-based policy language that is specific to the router operating system. They contain two types of functions: matching on one or multiple routes and an action for that match. The matching can include a list of actual IP prefixes and subnet lengths, ASN origins, AS-Paths or other types of BGP attributes (communities, next-hop, etc). The actions can include resetting BGP attributes such as local-preference, Multi-Exit-Discriminators (MED) and various other values (communities, Origin, etc). Below is a simplified example of a routing policy on routes learned from a transit provider. It has multiple terms to permit an operator to match on specific Internet routes to set a different local-preference value to control what traffic should be forwarded through that provider. There are additional actions to set other BGP attributes related to classifying the routes so they can be easily identified and acted upon by other routers in the network.

Network operators will tune their routing policy to determine how to send traffic and how to receive traffic through adjacent autonomous systems. This practice is generally known as BGP traffic-engineering. Making outbound traffic changes is by far the easiest to implement because it involves identifying the particular routes you are interested in directing and increasing the routing preference to egress through a particular adjacency. Operators must take care to examine certain things before and after any policy change to understand the impact of their actions.

Inbound traffic-engineering is a bit more difficult as it requires a network operator to alter routing information announcements leaving your network to influence how other autonomous systems on the Internet prefer to route to you. While influencing the directly adjacent networks to you is somewhat trivial, influencing networks further beyond those directly connected can be tricky. This technique requires the use of features that a transit provider can grant via BGP. In the BGP protocol, there is a certain type of attribute known as Communities. Communities are strings you can pass in a routing update across BGP sessions. Most networks use communities to classify routes as transit vs. peer vs. customer. The transit-customer relationship usually gives certain capabilities to a customer to control the further propagation of routes to their adjacencies. This grants a network with the ability to traffic-engineer further upstream to networks it is not directly connected to.

Traffic-engineering is used for several reasons today on the Internet. The first reason might be to reduce bandwidth costs by preferring particular paths (different transit providers). The other is for performance reasons, where a particular transit provider may have less-congested/lower-latency path to a destination network. Network operators will view a variety of metrics to determine if there is a problem and start to make policy changes to examine the outcome. Of course on the Internet, the scale of the traffic being moved around counts. Moving a few Gbps of traffic from one path to another may improve performance, but if you move tens of Gbps over you may encounter congestion on this newly selected path. The links between various networks on the Internet today operate where they scale capacity based upon observed utilization. Even though you may be paying a transit provider for connectivity, this doesn’t mean every link to external networks is scaled for the amount of traffic you wish to push. As traffic grows, links will be added between individual networks. So causing a massive change in utilization on the Internet can result in congestion as these new paths are handling an increased amount of traffic than they never had before. The result is that network operators must pay attention when moving traffic over in increments as well as communication with other networks to gauge the impact of any traffic moves.

Complicating the above traffic engineering operations is that you are not the only person on the Internet trying to push traffic to certain destinations. Other networks are also in a similar position where they’re trying to deliver traffic and will perform their own traffic-engineering. There are also many networks that will refuse to peer with other networks for several reasons. For example, some networks may cite an imbalance in in vs. outbound (traffic ratios) or feel that traffic is being dumped on their network. In these cases, the only way to reach these destinations is via a transit provider. In some cases, these networks may offer a “paid peering” product to provide direct connectivity. That paid peering product may be priced at a value that is lower the price of what you would pay for transit or could offer an uncongested path that you’d normally observe over transit. Just because you have a path via transit doesn’t mean the path is uncongested at all hours of the day (such as during peak hours).

One way to eliminate the hops between networks is to do just that – eliminate them via direct connections. AWS provides a service to do this known as AWS Direct Connect. With Direct Connect, customers can connect their network directly into the AWS network infrastructure. This will enable bypassing the Internet via direct physical connectivity and remove any potential Internet routing or capacity issues.

Traceroute

In order to determine the paths traffic is taking, tools such as traceroute are very useful. Traceroute operates by sending sending packets to a given destination network and it sets the initial IP TTL value to one. The upstream device will generate an ICMP TTL Exceeded message back to you (the source) which will reveal the first hop in your path to the destination. Subsequent packets will be sent from the source and increment the IP TTL value to show each hop along the way towards the destination. It is important to remember that Internet routing typically involves asymmetric paths – the traffic going towards a destination will take a separate set of hops on the return path. When performing traceroutes to diagnose routing issues it is very useful to obtain the reverse path to help isolate a particular direction of traffic being an issue. With an understanding of both directions traffic is taking, it is then easier to understand what sort of traffic-engineering changes can be made. When dealing with Network Operation Centers (NOCs) or support groups, it is important to provide the Public IP of the source and destination addresses involved in the communication. This provides individuals with the information they can use to help reproduce the issue that is being encountered. It is also useful to include any specific details surrounding the communication, such as if it was HTTP (TCP/80) or HTTPS (TCP/443). Some traceroute applications provide the user with the ability to generate its probes using a variety of protocols such as ICMP Echo Request (ping), UDP or TCP packets to a particular port. Several traceroute programs by default will use ICMP Echo Request or UDP packets (destined to a particular port range). While these work most of the time, various networks on the Internet may filter these sorts of packets and it is recommended to use a traceroute probe that replicates the type of traffic you intend to use to the destination network. For example, using traceroute with TCP/80 or TCP/443 can yield better results when dealing with firewalls or other packet filtering.

An example of a UDP based traceroute (using well-defined traceroute port ranges), where multiple routes will permit generating TTL Exceeded for packets bound for those destination ports:

Note that the last hop does not respond, since it most likely denies UDP packets destined to high ports.

With the same traceroute using TCP/443 (HTTPS), we find multiple routers do not respond but the destination does respond since it is listening on TCP/443:

TCP Traceroute to port 443 (HTTPS):

The hops revealed within traceroute provide some insight into the sort of network devices your packets are traversing. Many network operators will add descriptive information in the DNS reverse PTR records, though each network is going to be different. Typically the DNS entries will indicate the router name, some sort of geographical code and the physical or logical router interface the traffic has traversed. Each individual network names their own routers different so the information here is going to usually indicate if a device is a “core” router (no external or customer interfaces) or an “edge” router (with external network connectivity). Of course, this is not a hard rule and it is common to find multi-function devices within a network. The geographical identifier can vary between IATA airport codes, telecom CLLI codes (or a variation upon them) or internally generated identifiers that are unique to that particular network. Occasionally shortened versions of a physical address or city names will appear in here as well. The actual interface can indicate the interface type and speed, though these are only as accurate as you believe an operator is to publicly reveal this and keep their DNS entries up to date.

One important part of traceroute is that the data should be taken with some skepticism. Traceroute will display round-trip-time (RTT) of each individual hop as the packets traverse through the network to their destination. While this value can provide some insight into the latency to these hops, the actual value can be influenced from a variety of factors. For instance, many modern routers today treat packets that TTL expire on them as a low priority when compared to other functions the router is doing (forwarding packets, routing protocols). As a result, the handling of the TTL expired packets and subsequent ICMP TTL Exceeded message generated can take some period of time. This is why it is very common to occasionally see high RTT on intermediate hops within a traceroute (up to hundreds of milliseconds). This does not always indicate that there is a network issue and individuals should always measure the end-to-end latency (via ping or some application tests). In situations where the RTT does increase at a particular hop and continue to increase, this can be an indicator of an overall increase in latency at a particular point in the network. Another item frequently observed in traceroutes are hops that do not respond to traceroute which will be displayed as *’s. This means that the router(s) at this particular hop have either dropped the TTL expired packet or has not generated the ICMP TTL Exceeded message. This is usually the result of two possible things. The first is that many modern routers today implement Control-Plane Policing (CoPP) which are packet filters on the router to control how certain types of packets are handled. In many modern routers today, the use of ASICs (Application-Specific Integrated Circuit) have improved packet lookup & forwarding functions. When a router ASIC receives a packet with the TTL value of one, they will punt the packet to an additional location within the router to handle the ICMP TTL Exceeded generation. On most routers, the ICMP TTL Exceeded generation is done on a CPU integrated on a linecard or the main brain of the router itself (known as a route processor, routing engine or supervisor). Since the CPU of a linecard or routing engine is busy performing things such as forwarding table programming and routing protocols, routers will allow protections to be put in place to restrict the rate of how many TTL exceeded packets can be sent to these components. CoPP allows an operator to set functions such as limiting TTL Exceeded messages to a value such as 100 packets per second. Additionally the router itself may have an additional rate-limiter to address how many ICMP TTL Exceeded messages can be generated as well. In this situation, you’ll find that hops in your traceroute may sometimes not reply at all because of the use of CoPP. This is also why when performing pings to individual hops (routers) on a traceroute you will see packet loss because CoPP is dropping the packets. The other area where CoPP can be applied is where the router may simply deny all TTL exceeded packets. Within traceroute, these hops will always respond with *’s no matter how many times you execute traceroute.

A good presentation that explains using traceroute on the Internet and interpreting its results is found here: https://www.nanog.org/meetings/nanog45/presentations/Sunday/RAS_traceroute_N45.pdf

Troubleshooting issues on the Internet is no easy task and it requires examining multiple sets of information (traceroute, BGP routing tables) to come to a conclusion as to what can be occurring. The use of Internet looking glasses or route servers is useful to providing a different vantage point on the Internet when troubleshooting. The Looking Glass Wikipedia page has several links to sites which you can use to perform pings, traceroutes and examining a BGP routing table from different spots around the world in various networks.

When reaching out to networks or posting in forums looking for support for Internet routing issues it is important to provide useful information for troubleshooting. This includes the source IP address (the Public IP, not a Private/NAT translated one), the destination IP (once again, the Public IP), what protocol and ports being used (TCP/80 for example) and the specific time/date of when you observed the issue. Traceroutes in both directions are incredibly useful since paths on the Internet can be asymmetric.

AWS and Compartmentalization

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/architecture/aws-and-compartmentalization/

Practically every experienced driver has suffered a flat tire. It’s a real nuisance, you pull over, empty the trunk to get out your spare wheel, jack up the car and replace the puncture before driving yourself to a nearby repair shop. For a car that’s ok, we can tolerate the occasional nuisance, and as drivers we’re never that far from a safe place to pull over or a friendly repair shop.

Using availability terminology, a spare tire is a kind of standby, a component or system that is idly waiting to be deployed when needed. These are common in computer systems too. Many databases rely on standby failover for example, and some of them even rely on personal intervention, with a human running a script as they might wind a car-jack (though we’d recommend using an Amazon Relational Database instead, which include automated failover).

But when the stakes are higher, things are done a little differently. Take the systems in a modern passenger jet for example, which despite recent tragic events, have a stellar safety record. A flight can’t pull over, and in the event of a problem an airliner may have to make it several hours before being within range of a runway. For passenger jets it’s common for critical systems to use active redundancy. A twin-engine jet can fly with just one working engine, for example – so if one fails, the other can still easily keep the jet in the air.

This kind of model is also common in large web systems. There are many EC2 instances handling amazon.com for example, and when one occasionally fails there’s a buffer of capacity spread across the other servers ensuring that customers don’t even notice.

Jet engines don’t simply fail on their own though. Any one of dozens of components—digital engine controllers, fuel lines and pumps, gears and shafts, and so on–can cause the engine to stop working. For every one of these components, the aircraft designers could try to include some redundancy at the component level (and some do, such as avionics), but there are so many that it’s easier to re-frame the design in terms of fault isolation or compartmentalization: as long as each engine depends on separate instances of each component, then no one component can take out both engines. A fuel line may break, but it can only stop one engine from functioning, and the plane has already been designed to work with one engine out.

This kind of compartmentalization is particularly useful for complex computer systems. A large website or web service may depend on tens or even hundreds of sub-services. Only so many can themselves include robust active redundancy. By aligning instances of sub-services so that inter-dependencies never go across compartments we can make sure that a problem can be contained to the compartment it started in. It also means that we can try to resolve problems by quarantining whole compartments, without needing to find the root of the problem within the compartment.

AWS and Compartmentalization

Amazon Web Services includes some features and offerings that enable effective compartmentalization. Firstly, many Amazon Web Services—for example, Amazon S3 and Amazon RDS—are themselves internally compartmentalized and make use of active redundancy designs so that when failures occur they are hidden.

We also offer web services and resources in a range of sizes, along with automation in the form of auto-scaling, CloudFormation templates, and Opsworks recipes that make it easy to manage a higher number of instances.

There is a subtle but important distinction between running a small number of large instances, and a large number of small instances. Four m3.xlarge instances cost as much as two m3.2xlarge instances and provide the same amount of CPU and storage; but for high availability configurations, using four instances requires only a 33% failover capacity buffer and any host-level problem may impact one quarter of your load, whereas using two instances means a 100% buffer and any problem may impact half of your load.

Thirdly, Amazon Web Services has pre-made compartments: up to four availability zones per region. These availability zones are deeply compartmentalized down to the datacenter, network and power level.

Suppose that we create a web site or web service that utilizes four availability zones. This means we need a 25% failover capacity buffer per zone (which compares well to a 100% failover capacity buffer in a standard two data center model). Our service consists of a front end, two dependent backend services (“Foo” and “Bar”) and a data-store (for this example, we’ll use S3).

By constraining any sub-service calls to stay “within” the availability zone we make it easier to isolate faults. If backend service “Bar” fails (for example a software crash) in us-east-1b, this impacts 1/4th of our over-all capacity.

Initially this may not seem much better than if we had spread calls to the Bar service from all zones across all instances of the Bar service; after all, the failure rate would also be one fifth. But the difference is profound.

Firstly, experience has shown that small problems can often become amplified in complex systems. For example if it takes the “Foo” service longer to handle a failed call to the “Bar” service, then the initial problem with the “Bar” service begins to impact the behavior of “Foo” and in turn the frontends.

Secondly, by having a simple all-purpose mechanism to fail away from the infected availability zone, the problem can be reliably, simply, and quickly neutralized, just as a plane can be designed to fly on one engine and many types of failure handled with one procedure—if the engine is malfunctioning and a short checklist’s worth of actions don’t restore it to health, just shut it down and land at the next airport.

Route 53 Infima

Our suggested mechanism for handling this kind of failure is Amazon Route 53 DNS Failover. As DNS is the service that turns service/website names into the list of particular front-end IP addresses to connect to, it sits at the start of every request and is an ideal layer to neutralize problems.

With Route 53 health checks and DNS failover, each front-end is constantly health checked and automatically removed from DNS if there is a problem. Route 53 Health Check URLs are fully customizable and can point to a script that checks every dependency in the availability zone (“Is Foo working, Is Bar working, is S3 reachable, etc …”).

This brings us to Route 53 Infima. Infima is a library designed to model compartmentalization systematically and to help represent those kinds of configurations in DNS. With Infima, you assign endpoints to specific compartments such as availability zone. For advanced configurations you may also layer in additional compartmentalization dimensions; for example you may want to run two different software implementations of the same service (perhaps for blue/green deployments, for application-level redundancy) in each availability zone.

Once the Infima library has been taught the layout of endpoints within the compartments, failures can be simulated in software and any gaps in capacity identified. But the real power of Infima comes in expressing these configurations in DNS. Our example service had 4 endpoints, in 4 availability zones. One option for expressing this in DNS is to return each endpoint one time in every four. Each answer could also depend on a health check, and when the health check fails, it could be removed from DNS. Infima supports this configuration.

However, there is a better option. DNS (and naturally Route 53) allows several endpoints to be represented in a single answer, for example:

 

When clients (such as browsers or web services clients) receive these answers they generally try several endpoints until they find one that successfully connects. So by including all of the endpoints we gain some fault tolerance. When an endpoint is failing though, as we’ve seen before, the problem can spread and clients can incur retry timers and some delay, so it’s still desirable to remove IPs from DNS answers in a timely manner.

Infima can use the list of compartments, endpoints and their healthchecks to build what we call a RubberTree, a pre-computed decision tree of DNS answers that has answers pre-baked ready and waiting for potential failures: a single node failing, a whole compartment failing, combinations of each and so on. This decision tree is then stored as a Route 53 configuration and can automatically handle any failures. So if the 192.0.2.3 endpoint were to fail, then:

 

will be returned. By having these decision trees pre-baked and always ready and waiting, Route 53 is able to react quickly to endpoint failures, which with compartmentalization means we are also ready to handle failures of any sub-service serving that endpoint.

The compartmentalization we’ve seen so far is most useful for certain kinds of errors; host-level problems, occasional crashes, application-lockups. But if the problem originates with front-end level requests themselves, for example a denial of service attack, or a “poison pill” request that triggers a calamitous bug then it can quickly infect all of your compartments. Infima also includes some neat functionality to assist in isolating even these kinds of faults, and that will be the topic of our next post.

Bonus Content: Busting Caches

I wrote that removing failing endpoints from DNS in a timely manner is important, even when there are multiple endpoints in an answer. One problem we respond to in this area is broken application-level DNS caching. Certain platforms, including many versions of Java do not respect DNS cache lifetimes (the DNS time-to-live or TTL value) and once a DNS response has been resolved it will be used indefinitely.

One way to mitigate this problem is to use cache “busting”. Route 53 support wildcard records (and wildcard ALIASes, CNAMEs and more). Instead of using a service name such as: “api.example.com”, it is possible to use a wildcard name such as “*.api.example.com”, which will match requests for any name ending in “.api.example.com”.

An application may then be written in such a way as to resolve a partially random name, e.g. “sdsHdsk3.api.example.com”. This name, since it ends in api.example.com will still receive the right answer, but since it is a unique random name every time, it will defeat (or “bust”) any broken platform or OS DNS caching.

– Colm MacCárthaigh