Tag Archives: software

Easily deploy SaaS products with new Quick Launch in AWS Marketplace

2023-11-30 Marcia Villalba

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/easily-deploy-saas-products-with-new-quick-launch-in-aws-marketplace/

Today we are excited to announce the general availability of SaaS Quick Launch, a new feature in AWS Marketplace that makes it easy and secure to deploy SaaS products.

Before SaaS Quick Launch, configuring and launching third-party SaaS products could be time-consuming and costly, especially in certain categories like security and monitoring. Some products require hours of engineering time to manually set up permissions policies and cloud infrastructure. Manual multistep configuration processes also introduce risks when buyers rely on unvetted deployment templates and instructions from third-party resources.

SaaS Quick Launch helps buyers make the deployment process easy, fast, and secure by offering step-by-step instructions and resource deployment using preconfigured AWS CloudFormation templates. The software vendor and AWS validate these templates to ensure that the configuration adheres to the latest AWS security standards.

Getting started with SaaS Quick Launch
It’s easy to find which SaaS products have Quick Launch enabled when you are browsing in AWS Marketplace. Products that have this feature configured have a Quick Launch tag in their description.

After completing the purchase process for a Quick Launch–enabled product, you will see a button to set up your account. That button will take you to the Configure and launch page, where you can complete the registration to set up your SaaS account, deploy any required AWS resources, and launch the SaaS product.

The first step ensures that your account has the required AWS permissions to configure the software.

The second step involves configuring the vendor account, either to sign in to an existing account or to create a new account on the vendor website. After signing in, the vendor site may pass essential keys and parameters that are needed in the next step to configure the integration.

The third step allows you to configure the software and AWS integration. In this step, the vendor provides one or more CloudFormation templates that provision the required AWS resources to configure and use the product.

The final step is to launch the software once everything is configured.

Availability
Sellers can enable this feature in their SaaS product. If you are a seller and want to learn how to set this up in your product, check the Seller Guide for detailed instructions.

To learn more about SaaS in AWS Marketplace, visit the service page and view all the available SaaS products currently in AWS Marketplace.

— Marcia

Intel X86-S Streamlined 64-bit Instruction Set A Perspective

2023-05-21 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/intel-x86-s-streamlined-64-bit-instruction-set-a-perspective/

Intel released an initial X86-S simplification white paper that would remove 16-bit and 32-bit support. We say do it and explain why

The post Intel X86-S Streamlined 64-bit Instruction Set A Perspective appeared first on ServeTheHome.

Streaming Android games from cloud to mobile with AWS Graviton-based Amazon EC2 G5g instances

2023-04-13 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/streaming-android-games-from-cloud-to-mobile-with-aws-graviton-based-amazon-ec2-g5g-instances/

This blog post is written by Vincent Wang, GCR EC2 Specialist SA, Compute.

Streaming games from the cloud to mobile devices is an emerging technology that allows less powerful and less expensive devices to play high-quality games with lower battery consumption and less storage capacity. This technology enables a wider audience to enjoy high-end gaming experiences from their existing devices, such as smartphones, tablets, and smart TVs.

To load games for streaming on AWS, it’s necessary to use Android environments that can utilize GPU acceleration for graphics rendering and optimize for network latency. Cloud-native products, such as the Anbox Cloud Appliance or Genymotion available on the AWS Marketplace, can provide a cost-effective containerized solution for game streaming workloads on Amazon Elastic Compute Cloud (Amazon EC2).

For example, Anbox Cloud’s virtual device infrastructure can run games with low latency and high frame rates. When combined with the AWS Graviton-based Amazon EC2 G5g instances, which offer a cost reduction of up to 30% per-game stream per-hour compared to x86-based GPU instances, it enables companies to serve millions of customers in a cost-efficient manner.

In this post, we chose the Anbox Cloud Appliance to demonstrate how you can use it to stream a resource-demanding game called Genshin Impact. We use a G5g instance along with a mobile phone to run the streamed game inside of a Firefox browser application.

Overview

Graviton-based instances utilize fewer compute resources than x86-based instances due to the 64-bit architecture of Arm processors used in AWS Graviton servers. As shown in the following diagram, Graviton instances eliminate the need for cross-compilation or Android emulation. This simplifies development efforts and reduces time-to-market, thereby lowering the cost-per-stream. With G5g instances, customers can now run their Android games natively, encode CPU or GPU-rendered graphics, and stream the game over the network to multiple mobile devices.

Figure 1: Architecture difference when running Android on X86-based instance and Graviton-based instance.

Real-time ray-traced rendering is required for most modern games to deliver photorealistic objects and environments with physically accurate shadows, reflections, and refractions. The G5g instance, which is powered by AWS Graviton2 processors and NVIDIA T4G Tensor Core GPUs, provides a cost-effective solution for running these resource-intensive games.

Architecture

Figure 2: Architecture of Android Streaming Game.

When streaming games from a mobile device, only input data (touchscreen, audio, etc.) is sent over the network to the game streaming server hosted on a G5g instance. Then, the input is directed to the appropriate Android container designated for that particular client. The game application running in the container processes the input and updates the game state accordingly. Then, the resulting rendered image frames are sent back to the mobile device for display on the screen. In certain games, such as multiplayer games, the streaming server must communicate with external game servers to reflect the full game state. In these cases, additional data is transferred to and from game servers and back to the mobile client. The communication between clients and the streaming server is performed using the WebRTC network protocol to minimize latency and make sure that users’ gaming experience isn’t affected.

The Graviton processor handles compute-intensive tasks, such as the Android runtime and I/O transactions on the streaming server. However, for resource-demanding games, the Nvidia GPU is utilized for graphics rendering. To scale effortlessly, the Anbox Cloud software can be utilized to manage and execute several game sessions on the same instance.

Prerequisites

First, you need an Ubuntu single sign-on (SSO) account. If you don’t have one yet, you may create one from Ubuntu One website. Then you need an Android mobile phone with Firefox or Chrome browser installed to play the streaming games.

Setup

We can install Anbox Cloud Appliance in the AWS Marketplace. Select the Arm variant so that it works on Graviton-based instances. If the subscription doesn’t work on the first try, then you receive an email which guides you to a page where you can try again.

Figure 3: Subscribe Anbox Cloud Appliance in AWS Marketplace.

In this demonstration, we select G5g.xlarge in the Instance type section and leave all settings with default values, except the storage as per the following:

A root disk with minimum 50 GB (required)
An additional Amazon Elastic Block Store (Amazon EBS) volume with at least 100 GB (recommended)

For the Genshin Impact demo, we recommend a specific amount of storage. However, when deploying your Android applications, you must select an appropriate storage size based on the package size. Additionally, you should choose an instance size based on the resources that you plan to utilize for your gaming sessions, such as CPU, memory, and networking. In our demo, we launched only one session from a single mobile device.

Launch the instance and wait until it reaches running status. Then you can secure shell (SSH) to the instance to configure the Android environment.

Install Anbox cloud

To make sure of the security and reliability of some of the package repositories used, we update the CUDA Linux GPG Repository Key. View this Nvidia blog post for more details on this procedure.

$ sudo apt-key del 7fa2af80

$ wget

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/sbsa/cuda keyring_1.0-1_all.deb

$ sudo dpkg -i cuda-keyring_1.0-1_all.deb

As the Android in Anbox Cloud Appliance is running in an LXD container environment, upgrade LXD to the latest version.

$ sudo snap refresh –channel=5.0/stable lxd

Install the Anbox Cloud Appliance software using the following command and selecting the default answers:

$ sudo anbox-cloud-appliance init

Watch the status page at https://$(ec2_public_DNS_name) for progress information.

Figure 4: The status of deploying Anbox Cloud.

The initialization process takes approximately 20 minutes. After it’s complete, register the Ubuntu SSO account previously created, then follow the instructions provided to finalize the process.

$ anbox-cloud-appliance dashboard register <your Ubuntu SSO email address>

Stream an Android game application

Use the sample from the following repo to setup the service on the streaming server:

$ git clone https://github.com/anbox-cloud/cloud-gaming-demo.git

Build the Flutter web UI:

$ sudo snap install flutter –classic

$ cd cloud-gaming-demo/ui && flutter build web && cd ..

$ mkdir -p backend/service/static

$ cp -av ui/build/web/* backend/service/static

Then build the backend service which processes requests and interacts with the Anbox Stream Gateway to create instances of game applications. Start by preparing the environment:

$ sudo apt-get install python3-pip

$ sudo pip3 install virtualenv

$ cd backend && virtualenv venv

Create the configuration file for the backend service so that it can access the Anbox Stream Gateway. There are two parameters to set: gateway-URL and gateway-token. The gateway token can be obtained from the following command:

$ anbox-cloud-appliance gateway account create <account-name>

Create a file called config.yaml that contains the two values:

gateway-url: https:// <EC2 public DNS name>

gateway-token: <gateway_token>

Add the following line to the activate hook in the backend/venv/bin/ directory so that the backend service can read config.yaml on its startup:

$ export CONFIG_PATH=<path_to_config_yaml>

Now we can launch the backend service which will be served by default on TCP port 8002.

$./run.sh

In the next steps, we download a game and build it via Anbox Cloud. We need an Android APK and a configuration file. Create a folder under the HOME directory and create a manifest.yaml file in the folder. In this example, we must add the following details in the file. You can refer to the Anbox Cloud documentation for more information on the format.

name: genshin

instance-type: g10.3

resources:

cpus: 10

memory: 25GB

disk-size: 50GB

gpu-slots: 15

features: [“enable_virtual_keyboard”]

Select an APK for the arm64-v8a architecture which is natively supported on Graviton. In this example, we download Genshin Impact, an action role-playing game developed and published by miHoYo. You must supply your own Android APK if you want to try these steps. Download the APK into the folder and rename it to app.apk. Overall, the final layout of the game folder should look as follows:

├── app.apk

└── manifest.yaml

Run the following command from the folder to create the application:

$ amc application create .

Wait until the application status changes to ready. You can monitor the status with the following command:

$ amc application ls

Edit the following:

Update the gameids variable defined in the ui/lib/homepage.dart file to include the name of the game (as declared in the manifest file).
Insert a new key/value pair to the static appNameMap and appDesMap variables defined in the lib/api/application.dart file.
Provide a screenshot of the game (in jpeg format), rename it to <game-name>.jpeg, and put it into the ui/lib/assets directory.

Then, re-build the web UI, copy the contents from the ui/build/web folder to the backend/service/static directory, and refresh the webpage.

Test the game

Using your mobile phone, open the Firefox browser or another browser that supports WebRTC. Type the public DNS name of the G5g instance with the 8002 TCP port, and you should see something similar to the following:

Figure 5: The webpage of the Android streaming game portal.

Select the Play now button, wait a moment for the application to be setup on the server side, and then enjoy the game.

Figure 6: The screen capture of playing Android streaming game.

Clean-up

Please cancel the subscription of the Anbox Cloud Appliance in the AWS Marketplace, you can follow the AWS Marketplace Buyer Guide for more details, then terminate the G5g.xlarge instance to avoid incurring future costs.

Conclusion

In this post, we demonstrated how a resource-intensive Android game runs natively on a Graviton-based G5g instance and is streamed to an Arm-based mobile device. The benefits include better price-performance, reduced development effort, and faster time-to-market. One way to run your games efficiently on the cloud is through software available on the AWS Marketplace, such as the Anbox Cloud Appliance, which was showcased as an example method.

To learn more about AWS Graviton, visit the official product page and the technical guide.

Geekbench 6 Launched Big Benchmark Updates We Try It

2023-02-14 Patrick Kennedy

Post Syndicated from Patrick Kennedy original https://www.servethehome.com/geekbench-6-launched-big-benchmark-updates-we-try-it/

Geekbench 6 is out. We have had the chance to try a few runs of the Windows and Linux versions of the benchmark software prior to its release

The post Geekbench 6 Launched Big Benchmark Updates We Try It appeared first on ServeTheHome.

Let’s Architect! Architecting for sustainability

2023-02-08 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-architecting-for-sustainability/

Sustainability is an important topic in the tech industry, as well as society as a whole, and defined as the ability to continue to perform a process or function over an extended period of time without depletion of natural resources or the environment.

One of the key elements to designing a sustainable workload is software architecture. Think about how event-driven architecture can help reduce the load across multiple microservices, leveraging solutions like batching and queues. In these cases, the main traffic is absorbed at the entry-point of a cloud workload and ease inside your system. On top of architecture, think about data patterns, hardware optimizations, multi-environment strategies, and many more aspects of a software development lifecycle that can contribute to your sustainable posture in the Cloud.

The key takeaway: designing with sustainability in mind can help you build an application that is not only durable but also flexible enough to maintain the agility your business requires.

In this edition of Let’s Architect!, we share hands-on activities, case studies, and tips and tricks for making your Cloud applications more sustainable.

Architecting sustainably and reducing your AWS carbon footprint

Amazon Web Services (AWS) launched the Sustainability Pillar of the AWS Well-Architected Framework to help organizations evaluate and optimize their use of AWS services, and built the customer carbon footprint tool so organizations can monitor, analyze, and reduce their AWS footprint.

This session provides updates on these programs and highlights the most effective techniques for optimizing your AWS architectures. Find out how Amazon Prime Video used these tools to establish baselines and drive significant efficiencies across their AWS usage.

Take me to this re:Invent 2022 video!

Prime Video case study for understanding how the architecture can be designed for sustainability

Optimize your modern data architecture for sustainability

The modern data architecture is the foundation for a sustainable and scalable platform that enables business intelligence. This AWS Architecture Blog series provides tips on how to develop a modern data architecture with sustainability in mind.

Comprised of two posts, it helps you revisit and enhance your current data architecture without compromising sustainability.

Take me to Part 1! | Take me to Part 2!

An AWS data architecture; it’s now time to account for sustainability

AWS Well-Architected Labs: Sustainability

This workshop introduces participants to the AWS Well-Architected Framework, a set of best practices for designing and operating high-performing, highly scalable, and cost-efficient applications on AWS. The workshop also discusses how sustainability is critical to software architecture and how to use the AWS Well-Architected Framework to improve your application’s sustainability performance.

Take me to this workshop!

Sustainability implementation best practices and monitoring

Sustainability in the cloud with Rust and AWS Graviton

In this video, you can learn about the benefits of Rust and AWS Graviton to reduce energy consumption and increase performance. Rust combines the resource efficiency of programming languages, like C, with memory safety of languages, like Java. The video also explains the benefits deriving from AWS Graviton processors designed to deliver performance- and cost-optimized cloud workloads. This resource is very helpful to understand how sustainability can become a driver for cost optimization.

Take me to this re:Invent 2022 video!

Discover how Rust and AWS Graviton can help you make your workload more sustainable and performant

See you next time!

Thanks for joining us to discuss sustainability in the cloud! See you in two weeks when we’ll talk about tools for architects.

To find all the blogs from this series, you can check the Let’s Architect! list of content on the AWS Architecture Blog.

Let’s Architect! Designing event-driven architectures

2023-01-25 Luca Mezzalira

Post Syndicated from Luca Mezzalira original https://aws.amazon.com/blogs/architecture/lets-architect-designing-event-driven-architectures/

During the design of distributed systems, we have to identify a communication strategy to exchange information between different services while keeping the evolutionary nature of the architecture in mind. Event-driven architectures are based on events (facts that happened in a system), which are asynchronously exchanged to implement communication across different services while having a high degree of decoupling. This paradigm also allows us to run code in response to events, with benefits like cost optimization and sustainability for the entire infrastructure.

In this edition of Let’s Architect!, we share architectural resources to introduce event-driven architectures, how to build them on AWS, and how to approach the design phase.

AWS re:Invent 2022 – Keynote with Dr. Werner Vogels

re:Invent 2022 may be finished, but the keynote given by Amazon’s Chief Technology Officer, Dr. Werner Vogels, will not be forgotten. Vogels not only covered the announcements of new services but also event-driven architecture foundations in conjunction with customers’ stories on how this architecture helped to improve their systems.

Take me to this re:Invent 2022 video!

Dr. Werner Vogels presenting an example of architecture where Amazon EventBridge is used as event bus

Benefits of migrating to event-driven architecture

In this blog post, we enumerate clearly and concisely the benefits of event-driven architectures, such as scalability, fault tolerance, and developer velocity. This is a great post to start your journey into the event-driven architecture style, as it explains the difference from request-response architecture.

Take me to this Compute Blog post!

Two common options when building applications are request-response and event-driven architectures

Building next-gen applications with event-driven architectures

When we build distributed systems or migrate from a monolithic to a microservices architecture, we need to identify a communication strategy to integrate the different services. Teams who are building microservices often find that integration with other applications and external services can make their workloads tightly coupled.

In this re:Invent 2022 video, you learn how to use event-driven architectures to decouple and decentralize application components through asynchronous communication. The video introduces the differences between synchronous and asynchronous communications before drilling down into some key concepts for designing and building event-driven architectures on AWS.

Take me to this re:Invent 2022 video!

How to use choreography to exchange information across services plus implement orchestration for managing operations within the service boundaries

Designing events

When starting on the journey to event-driven architectures, a common challenge is how to design events: “how much data should an event contain?” is a typical first question we encounter.

In this pragmatic post, you can explore the different types of events, watch a video that explains even further how to use event-driven architectures, and also go through the new event-driven architecture section of serverlessland.com.

Take me to Serverless Land!

An example of events with sparse and full state description

See you next time!

Thanks for reading our first blog of 2023! Join us next time, when we’ll talk about architecture and sustainability.

To find all the blogs from this series, visit the Let’s Architect! section of the AWS Architecture Blog.

The tale of a single register value

2021-11-03 Jakub Sitnicki

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/the-tale-of-a-single-register-value/

“Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.” — Sherlock Holmes

Intro

The tale of a single register value

It’s not every day that you get to debug what may well be a packet of death. It was certainly the first time for me.

What do I mean by “a packet of death”? A software bug where the network stack crashes in reaction to a single received network packet, taking down the whole operating system with it. Like in the well known case of Windows ping of death.

Challenge accepted.

It starts with an oops

Around a year ago we started seeing kernel crashes in the Linux ipv4 stack. Servers were crashing sporadically, but we learned the hard way to never ignore cases like that — when possible we always trace crashes. We also couldn’t tie it to a particular kernel version, which could indicate a regression which hopefully could be tracked down to a single faulty change in the Linux kernel.

The crashed servers were leaving behind only a crash report, affectionately known as a “kernel oops”. Let’s take a look at it and go over what information we have there.

Parts of the oops, like offsets into functions, need to be decoded in order to be human-readable. Fortunately Linux comes with the decode_stacktrace.sh script that did the work for us.

All we need is to install a kernel debug and source packages before running the script. We will use the latest version of the script as it has been significantly improved since Linux v5.4 came out.

$ RELEASE=`uname -r`
$ apt install linux-image-$RELEASE-dbg linux-source-$RELEASE
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decode_stacktrace.sh
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decodecode
$ chmod +x decode_stacktrace.sh decodecode
$ ./decode_stacktrace.sh -r 5.4.14-cloudflare-2020.1.11 < oops.txt > oops-decoded.txt

When decoded, the oops report is even longer than before! But that is a good thing. There is new information there that can help us.

What has happened?

With this much input we can start sketching a picture of what could have happened. First thing to check is where exactly did we crash?

The report points at line 5160 in the skb_gso_transport_seglen() function. If we take a look at the source code, we can get a rough idea of what happens there. We are processing a Generic Segmentation Offload (GSO) packet carrying an encapsulated TCP packet. What is a GSO packet? In this context it’s a batch of consecutive TCP segments, travelling through the network stack together to amortize the processing cost. We will look more at the GSO later.

net/core/skbuff.c:
5150) static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
5151) {
          …
5155)     if (skb->encapsulation) {
                  …
5159)             if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
5160)                     thlen += inner_tcp_hdrlen(skb); 👈
5161)     } else if (…) {
          …
5172)     return thlen + shinfo->gso_size;
5173) }

The exact line where we crashed belongs to an if-branch that handles tunnel traffic. It calculates the length of the TCP header of the inner packet, that is the encapsulated one. We do that to compute the length of the outer L4 segment, which accounts for the inner packet length:

To understand how the length of the inner TCP header is computed we have to peel off a few layers of inlined function calls:

inner_tcp_hdrlen(skb)
⇓
inner_tcp_hdr(skb)->doff * 4
⇓
((struct tcphdr *)skb_inner_transport_header(skb))->doff * 4
⇓
((struct tcphdr *)(skb->head + skb->inner_transport_header))->doff * 4

Now it is clear that inner_tcp_hdrlen(skb) simply reads the Data Offset field (doff) inside the inner TCP header. Because Data Offset carries the number of 32-bit words in the TCP header, we multiply it by 4 to get the TCP header length in bytes.

From the memory access point of view, to read the Data Offset value we need to:

load skb->head value from address skb + offsetof(struct sk_buff, head)
load skb->inner_transport_header value from address skb + offsetof(struct sk_buff, inner_transport_header),
load the TCP Data Offset from skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff)

Potentially, any of these loads could trigger a page fault. But it’s unlikely that skb contains an invalid address since we accessed the skb->encapsulation field without crashing just a few lines earlier. Our main suspect is the last load.

The invalid memory address we attempt to load from should be in one of the CPU registers at the time of the exception. And we have the CPU register snapshot in the oops report. Which register holds the address? That has been decided by the compiler. We will need to take a look at the instruction stream to discover that.

Remember the disassembly in the decoded kernel oops? Now is the time to go back to it. Hint, it’s in AT&T syntax. But to give everyone a fair chance to follow along, here’s the same disassembly but in Intel syntax. (Alright, alright. You caught me. I just can’t read the AT&T syntax.)

All code
========
   0:   c0 41 83 e0             rol    BYTE PTR [rcx-0x7d],0xe0
   4:   11 f6                   adc    esi,esi
   6:   87 81 00 00 00 20       xchg   DWORD PTR [rcx+0x20000000],eax
   c:   74 30                   je     0x3e
   e:   0f b7 87 aa 00 00 00    movzx  eax,WORD PTR [rdi+0xaa]
  15:   0f b7 b7 b2 00 00 00    movzx  esi,WORD PTR [rdi+0xb2]
  1c:   48 01 c1                add    rcx,rax
  1f:   48 29 f0                sub    rax,rsi
  22:   45 85 c0                test   r8d,r8d
  25:   48 89 c6                mov    rsi,rax
  28:   74 0d                   je     0x37
  2a:*  0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]           <-- trapping instruction
  2e:   c0 e8 04                shr    al,0x4
  31:   0f b6 c0                movzx  eax,al
  34:   8d 04 86                lea    eax,[rsi+rax*4]
  37:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  3b:   01 d0                   add    eax,edx
  3d:   c3                      ret
  3e:   45                      rex.RB
  3f:   85                      .byte 0x85

Code starting with the faulting instruction
===========================================
   0:   0f b6 41 0c             movzx  eax,BYTE PTR [rcx+0xc]
   4:   c0 e8 04                shr    al,0x4
   7:   0f b6 c0                movzx  eax,al
   a:   8d 04 86                lea    eax,[rsi+rax*4]
   d:   0f b7 52 04             movzx  edx,WORD PTR [rdx+0x4]
  11:   01 d0                   add    eax,edx
  13:   c3                      ret
  14:   45                      rex.RB
  15:   85                      .byte 0x85

When the trapped page fault happened, we tried to load from address %rcx + 0xc, or 12 bytes from whatever memory location %rcx held. Which is hardly a coincidence since the Data Offset field is 12 bytes into the TCP header.

This means that %rcx holds the computed skb->head + skb->inner_transport_header address. Let’s take a look at it:

RSP: 0018:ffffa4740d344ba0 EFLAGS: 00010202
RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900
…

The RCX value doesn’t look particularly suspicious. We can say that:

it’s in a kernel virtual address space because it is greater than 0xffff000000000000 – expected, and
it is very close to the 4 KiB page boundary (0xffff9d9624bbb000 – 4),

… and not much more.

We must go back further in the instruction stream. Where did the value in %rcx come from? What I like to do is try to correlate the machine code leading up to the crash with pseudo source code:

<function entry>                # %rdi = skb
…
movzx  eax,WORD PTR [rdi+0xaa]  # %eax = skb->inner_transport_header
movzx  esi,WORD PTR [rdi+0xb2]  # %esi = skb->transport_header
add    rcx,rax                  # %rcx = skb->head + skb->inner_transport_header
sub    rax,rsi                  # %rax = skb->inner_transport_header - skb->transport_header
test   r8d,r8d
mov    rsi,rax                  # %rsi = skb->inner_transport_header - skb->transport_header
je     0x37
movzx  eax,BYTE PTR [rcx+0xc]   # %eax = *(skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff))

How did I decode that assembly snippet? We know that the skb address was passed to our function in the %rdi register because the System V AMD64 ABI calling convention dictates that. If the %rdi register hasn’t been clobbered by any function calls, or reused because the compiler decided so, then maybe, just maybe, it still holds the skb address.

If 0xaa and 0xb2 are offsets into an sk_buff structure, then pahole tool can tell us which fields they correspond to:

$ pahole --hex -C sk_buff /usr/lib/debug/vmlinux-5.4.14-cloudflare-2020.1.11 | grep '\(head\|inner_transport_header\|transport_header\);'
        __u16                      inner_transport_header; /*  0xaa   0x2 */
        __u16                      transport_header;     /*  0xb2   0x2 */
        unsigned char *            head;                 /*  0xc0   0x8 */

To confirm our guesswork, we can disassemble the whole function in gdb.

It would be great to find out the value of the inner_transport_header and transport_header offsets. But the registers that were holding them, %rax and %rsi, respectively, were reused after the offset values were loaded.

However, we can still examine the difference between inner_transport_header and transport_header that both %rax and %rsi hold. Let’s take a look.

The suspicious offset

Here are the register values from the oops as a reminder:

RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900

From the register snapshot we can tell that:

%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242

That is clearly suspicious. We expect that skb->transport_header < skb->inner_transport_header, so either

skb->inner_transport_header > 0xfeda, which would mean that between outer and inner L4 packets there is 65k+ bytes worth of headers – unlikely, or
0xfeda is a garbage value, perhaps an effect of an underflow if skb->inner_transport_header < skb->transport_header.

Let’s entertain the theory that an underflow has occurred.

Any other scenario, be it an out-of-bounds write or a use-after-free that corrupted the memory, is a scary prospect where we don’t stand much chance of debugging it without help from tools like KASAN report.

But if we assume for a moment that it’s an underflow, then the task is simple 😉. We “just” need to audit all places where skb->inner_transport_header or skb->transport_header offsets could have been updated while the skb buffer travelled through the network stack.

That raises the question — what path did the packet take through the network stack before it brought the machine down?

Packet path

It is time to take a look at the call trace in the oops report. If we walk through it, it is apparent that a veth device received a packet. The packet then got routed and forwarded to some other network device. The kernel crashed before the egress device transmitted the packet out.

What immediately draws our attention is the veth_poll() function in the call trace. Polling inside a virtual device that acts as a simple pipe joining two network namespaces together? Puzzling!

The regular operation mode of a veth device is that transmission of a packet from one side of a veth-pair results in immediate, in-line, on the same CPU, reception of the packet by the other side of the pair. There shouldn’t be any polling, context switches or such.

However, in Linux v4.19 veth driver gained support for native mode eXpress Data Path (XDP). XDP relies on NAPI, an interface between the network drivers and the Linux network stack. NAPI requires that drivers register a poll() callback for fetching received packets.

The NAPI receive path in the veth driver is taken only when there is an XDP program attached. The fork occurs in veth_forward_skb, where the TX path ends and a RX path on the other side begins.

This is an important observation because only on the NAPI/XDP path in the veth driver, received packets might get aggregated by the Generic Receive Offload.

Super-packets

Early on we’ve noted that the crash happens when processing a GSO packet. I’ve promised we will get back to it and now is the time.

Generic Segmentation Offload (GSO) is all about delaying the L4 segmentation process until the very last moment. So called super-packets, that exceed the egress route MTU in size, travel all the way through the network stack, only to be cut into MTU-sized segments just before handing the data over to the network driver for transmission. This way we process just one big packet on the transmit path, instead of a few smaller ones and save on CPU cycles in all the IP-level stack functions like routing, nftables, traffic control

Where do these super-packets come from? They can be a result of large write to a socket, or as is our case, they can be received from one network and forwarded to another network.

The latter case, that is forwarding a super-packet, happens when Generic Receive Offload (GRO) kicks in during receive. GRO is the opposite process of GSO. Smaller, MTU-sized packets get merged to form a super-packet early on the receive path. The goal is the same — process less by pushing just one packet through the network stack layers.

Not just any packets can be fused together by GRO. Loosely speaking, any two packets to be merged must form a logical sequence in the network flow, and carry the same metadata in protocol headers. It is critical that no information is lost in the aggregation process. Otherwise, GSO won’t be able to reconstruct the segment stream when serializing packets in the network card transmission code.

To this end, each network protocol that supports GRO provides a callback which signals whether the above conditions hold true. GRO implementation (dev_gro_receive()) then walks through the packet headers, the outer as well as the inner ones, and delegates the pre-merge check to the right protocol callback. If all stars align, the packets get spliced at the end of the callback chain (skb_gro_receive()).

I will be frank. The code that performs GRO is pretty complex, and I spent a significant amount of time staring into it. Hat tip to its authors. However, for our little investigation it will be enough to understand that a TCP stream encapsulated with GRE¹ would trigger callback chain like so:

Armed with basic GRO/GSO understanding we are ready to take a shot at reproducing the crash.

The reproducer

Let’s recap what we know:

a super-packet was received from a veth device,
the veth device had an XDP program attached,
the packet was forwarded to another device,
the egress device was transmitting a GSO super-packet,
the packet was encapsulated,
the super-packet must have been produced by GRO on ingress.

This paints a pretty clear picture on what the setup should look like:

We can work with that. A simple shell script will be our setup machinery.

We will be sending traffic from 10.1.1.1 to 10.2.2.2. Our traffic pattern will be a TCP stream consisting of two consecutive segments so that GRO can merge something. A Scapy script will be great for that. Let’s call it send-a-pair.py and give it a run:

$ { sleep 5; sudo ip netns exec A ./send-a-pair.py; } &
[1] 1603
$ sudo ip netns exec B tcpdump -i BA -n -nn -ttt 'ip and not arp'
…
 00:00:00.020506 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:1436, ack 1, win 8192, length 1436
 00:00:00.000082 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 1436:2872, ack 1, win 8192, length 1436

Where is our super-packet? Look at the packet sizes, the GRO didn’t merge anything.

Turns out NAPI is just too fast at fetching the packets from the Rx ring. We need a little buffering on transmit to increase our chances of GRO batching:

# Help GRO
ip netns exec A tc qdisc add dev AB root netem delay 200us slot 5ms 10ms packets 2 bytes 64k

With the delay in place, things look better:

 00:00:00.016972 IP 10.1.1.1 > 10.2.2.2: GREv0, length 2916: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:2872, ack 1, win 8192, length 2872

8192 bytes shown by tcpdump clearly indicate GRO in action. And we are even hitting the crash point:

$ sudo bpftrace -e 'kprobe:skb_gso_transport_seglen { print(kstack()); }' -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 1 probe...

        skb_gso_transport_seglen+1
        skb_gso_validate_network_len+17
        __ip_finish_output+293
        ip_output+113
        ip_forward+876
        ip_rcv+188
        __netif_receive_skb_one_core+128
        netif_receive_skb_internal+47
        napi_gro_flush+151
        napi_complete_done+183
        veth_poll+1697
        net_rx_action+314
        …

^C

…but we are not crashing. We will need to dig deeper.

We know what packet metadata skb_gso_transport_seglen() looks at — the header offsets, then encapsulation flag, and GSO info. Let’s dump all of it:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen

Since the skb->encapsulation flag (ENC) is set, both outer and inner header offsets should be valid. Are they?

The outer network / L3 header (NH) looks sane. When XDP is enabled, it reserves 256 bytes of headroom before the headers. 14 byte long Ethernet header follows the headroom. The IPv4 header should then start at 270 bytes into the packet buffer.

The outer transport / L4 header offset is as expected as well. The IPv4 header takes 20 bytes, and the GRE header follows it.

The inner network header (INH) begins at the offset of 294 bytes. This makes sense because the GRE header in its most basic form is 4 bytes long.

The surprise comes last. The inner transport header offset points somewhere near the end of headroom which XDP reserves. Instead, it should start at 314, following the inner IPv4 header.

Is this the smoking gun we were looking for?

The bug

skb_gso_transport_seglen() calculates the length of the outer L4 segment when given a GSO packet. If the inner_transport_header offset is off, then the result of the calculation might be off as well. Worth checking.

We know that our segments are 1500 bytes long. That makes the L4 part 1480 bytes long. What does skb_gso_transport_seglen() say though?

$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1460

Seems that we don’t agree. But if skb_gso_transport_seglen() is getting garbage on input we can’t really blame it.

If inner_transport_header is not correct, that TCP Data Offset read that we know happens inside the function cannot end well.

If we map it out, it looks like we are loading part of the source MAC address (upper 4 bits of the 5th byte, to be precise) and interpreting it as TCP Data Offset.

Are we? There is an easy way to check.

If we ask nicely, tcpdump will tell us what the MAC addresses are:

Plugging that into the calculations that skb_gso_transport_seglen() …

thlen = inner_transport_header(skb) - transport_header(skb) = 254 - 290 = -36
thlen += inner_transport_header(skb)->doff * 4 = -36 + (0xf * 4) = -36 + 60 = 24
retval = gso_size + thlen = 1436 + 24 = 1460

…checks out!

Does this mean that I can control the return value by setting source MAC address?!

                                               👇
$ sudo ip -n A link set AB address be:d6:07:5e:05:11 # Change the MAC address 
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1400

Yes! 1436 + (-36) + (0 * 4) = 1400. This is it.

However, how does all this tie it to the original crash? The badly calculated L4 segment length will make GSO emit shorter segments on egress. But that’s all.

Remember the suspicious offset from the crash report?

%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242

We now know that skb->transport_header should be 290. That makes skb->inner_transport_header = 65242 + 290 = 65532 = 0xfffc.

Which means that when we triggered the page fault we were trying to load memory from a location at

skb->head + skb->inner_transport_header + offsetof(tcphdr, doff) = skb->head + 0xfffc + 12 = 0xffff9d9624bbb008

Solving it for skb->head yields 0xffff9d9624bbb008 - 0xfffc - 12 = 0xffff9d9624bab000.

And this makes sense. The skb->head buffer is page-aligned, meaning it’s a multiple of 4 KiB on x86-64 platforms — the 12 least significant bits the address are 0.

However, the address we were trying to read was (0xfffc+12)/4096 ~= 16 pages (or 64 KiB) past the skb->head page boundary (0xffff9d9624babfff).

Who knows if there was memory mapped to this address?! Looks like from time to time there wasn’t anything mapped there and the kernel page fault handling code went “oops!”.

The fix

It is finally time to understand who sets the header offsets in a super-packet.

Once GRO is done merging segments, it flushes the super-packet down the pipe by kicking off a chain of gro_complete callbacks:

napi_gro_complete → inet_gro_complete → gre_gro_complete → inet_gro_complete → tcp4_gro_complete → tcp_gro_complete

These callbacks are responsible for updating the header offsets and populating the GSO-related fields in skb_shared_info struct. Later on the transmit side will consume this data.

Let’s see how the packet metadata changes as it travels through the gro_complete callbacks² by adding a few more tracepoint to our bpftrace script:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 7 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
BA   2936 294 314 0   254 254  |  1436 0    0x00 napi_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 inet_gro_complete
BA   2936 294 314 0   254 254  |  1436 0    0x00 gre_gro_complete
BA   2936 294 314 1   254 254  |  1436 0    0x40 inet_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x40 tcp4_gro_complete
BA   2936 294 314 1   294 254  |  1436 0    0x41 tcp_gro_complete
sink 2936 270 290 1   294 254  |  1436 2    0x41 skb_gso_transport_seglen

As the packet travels through the gro_complete callbacks, the inner network header (INH) offset gets updated after we have processed the inner IPv4 header.

However, the same thing did not happen to the inner transport header (ITH) that is causing trouble. We need to fix that.

--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -298,6 +298,9 @@ int tcp_gro_complete(struct sk_buff *skb)
        if (th->cwr)
                skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;

+       if (skb->encapsulation)
+               skb->inner_transport_header = skb->transport_header;
+
        return 0;
 }
 EXPORT_SYMBOL(tcp_gro_complete);

With the patch in place, the header offsets are finally all sane and skb_gso_transport_seglen() return value is as expected:

$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV  LEN  NH  TH  ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1   294 314  |  1436 2    0x41 skb_gso_transport_seglen

$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1480

Don’t worry, though. The fix is already likely in your kernel long time ago. Patch d51c5907e980 (“net, gro: Set inner transport header offset in tcp/udp GRO hook”) has been merged into Linux v5.14, and backported to v5.10.58 and v5.4.140 LTS kernels. The Linux kernel community has got you covered. But please, keep on updating your production kernels.

Outro

What a journey! We have learned a ton and fixed a real bug in the Linux kernel. In the end it was not a Packet of Death. Maybe next time we can find one 😉

Enjoyed the read? Why not join Cloudflare and help us fix the remaining bugs in the Linux kernel? We are hiring in Lisbon, London, and Austin.

And if you would like to see more kernel blog posts, please let us know!

…
¹Why GRE and not some other type of encapsulation? If you follow our blog closely, you might already know that Cloudflare Magic Transit uses veth pairs to route traffic into and out of network namespaces. It also happens to use GRE encapsulation. If you are curious why we chose network namespaces linked with veth pairs, be sure to watch the How we built Magic Transit talk.
²Just turn off GRO on all other network devices in use to get a clean output (sudo ethtool -K enp0s5 gro off).

How we build software at Cloudflare

2021-11-02 Nick Wood

Post Syndicated from Nick Wood original https://blog.cloudflare.com/building-software-at-cloudflare/

How we build software at Cloudflare

Cloudflare provides a broad range of products — ranging from security, to performance and serverless compute — which are used by millions of Internet properties worldwide. Often, these products are built by multiple teams in close collaboration and delivering them can be a complex task. So ever wondered how we do so consistently and safely at scale?

Software delivery consists of all the activities to get working software into the hands of customers. It’s usual to talk about software delivery with reference to a model, or framework. These provide the scaffolding for most modern software delivery models, although in order to minimise operational friction it’s usual for a company to tailor their approach to suit their business context and culture.

For example, a company that designs the autopilot systems for passenger aircraft will require very strict tolerances, as a failure could cost hundreds of lives. They would want a different process to a cutting edge tech startup, who may value time to market over system uptime or stability.

Before outlining the approach we use at Cloudflare it’s worth quickly running through a couple of commonly used delivery models.

The Waterfall Approach

Waterfall has its foundations (pun intended) in construction and manufacturing. It breaks a project up into phases and presumes that each phase is completed before the next begins. Each phase “cascades” into the next bit like a waterfall, hence the name.

The main criticism of waterfall approaches arises when flaws are discovered downstream, which may necessitate a return to earlier phases — though this can be managed through governance processes that allows for adjusting scope, budgets or timelines.

More recently there are a number of modified waterfall models which have been developed as a response to its perceived inflexibility. Some notable examples are the Rational Unified Process (RUP), which encourages iteration within phases, and Sashimi which provides partial overlap between phases.

Despite falling out of favour in recent years, waterfall still has a place in modern technology companies. It tends to be reserved for projects where the scope and requirements can be defined upfront and are unlikely to change. At Cloudflare, we use it for infrastructure rollouts, for example. It also has a place in very large projects with complex dependencies to manage.

Agile Approaches

Agile isn’t a single well-defined process, rather a family of approaches which share similar philosophies — those of the agile manifesto. Implementations vary, but most agile flavours tend to share a number of common traits:

Short release cycles, such that regular feedback (ideally from real users) can be incorporated.
Teams maintain a prioritized to-do list of upcoming work (often called a ‘backlog’), with the most valuable items are at the top.
Teams should be self-organizing, and work at a sustainable pace.
A philosophy of Continuous Improvement, where teams seek to improve their ways of working over time.

Continuous improvement is very much the heart of agile, meaning these approaches are less about nailing down “the correct process” and focus more on regular reflection and change. This means variances between any two teams is expected, and encouraged.

Agile approaches can be divided into two main branches — iterative and flow-based. Scrum is probably the most prevalent of the iterative agile methods. In Scrum a team aims to build shippable increments of code at regular intervals called sprints (or “iterations”). Flow-based approaches on the other hand (such as Kanban) instead pick up new items from their backlog on an ad hoc basis. They use a number of techniques to try and minimise work in progress across the team.

The main differences between the two branches can be typified by looking at two example teams:

The “Green” Team has a set of products they support and wants to update them regularly, production issues for them are rare and there is very little ad-hoc work. An iterative approach allows them to make long term plans whilst also being able to incorporate feedback from users with some regularity.
The “Blue” Team meanwhile is an operational team, where a big part of their role is to monitor production systems and investigate issues as they arise. For them, a flow based approach is much more appropriate, so they can update their plans on the fly as new items arise.

Which approach does Cloudflare follow?

Cloudflare comprises dozens of globally distributed engineering teams each with their own unique challenges and contexts. A team usually has an Engineering Manager, a Product Manager and less than 10 engineers, who all focus on a singular product or mission. The DDoS team for example is one such team.

A team that supports a newly released product will likely want to rapidly incorporate feedback from customers, whereas a team that manages shared internal platforms will prize platform stability over speed of innovation. There is a spectrum of different contexts within Cloudflare which makes it impossible to define a single software delivery method for all teams to follow.

Instead, we take a more nuanced approach where we allow teams to decide which methodology they wish to follow within the team, whilst also defining a number of high-level concepts and language that are common to all teams. In other words, we are more concerned with macro-management than micro-management.

“SHIP”s and “Epic”s

At the highest level, our unit of work is called a “SHIP” — this is a change to a service or product which we intend to ship to customers, hence the name. All live SHIPs are published on our internal roadmap, called our “SHIP-board”. Transparency and collaboration are part of our DNA at Cloudflare, so for us, it’s important that anyone in Cloudflare can view the SHIP-board.

Individual SHIPs are sized such that they can be comfortably delivered within a month or two, though we have a strong preference towards shorter timescales. We’d much rather deliver three small feature sets monthly than one big launch every quarter.

A single SHIP might need work from multiple teams in order for customers to use it. We manage this by ensuring there is an EPIC within the SHIP for each team contributing. To prevent circular dependencies, a SHIP can’t contain another SHIP. SHIPs are owned by their Product Manager, and EPICs are usually owned by an Engineering Manager. We also allow for EPICs to be created that don’t deliver against SHIPs — this is where technical improvement initiatives are typically managed.

Below the level of EPICs we don’t enforce any strict delivery model on teams, though teams will usually link their contributory work to the EPIC for ease of tracking. Teams are free to use whichever delivery framework they wish.

Within the Product Engineering organisation, all Product Managers and Engineering managers meet weekly to discuss progress and blockers of their live SHIPs/EPICs. Due to the number of people involved, this is a very rapid fire meeting facilitated by our automated “SHIP-board”. This has a built-in linter to highlight potential issues, to be updated prior to the meeting. We run through each team one by one, starting with the team with the most outstanding lints.

There’s also a few icons which let us visualise the status of a SHIP or EPIC at a glance. For example, a monkey means the target date for an item moved in the last week. Bananas count the total number of times the date has “slipped”, i.e. changed. A typical fragment of the SHIP-board is shown below.

Planning

Planning takes place every quarter. This lets us deliver aggressively, without having to change plans too frequently. It also forces us to make conscious choices about what to include and exclude from a SHIP so that extraneous work is minimised.

About a month before quarter-end, product managers will begin to compile the SHIPs that would deliver the most value to customers. They’ll work with their engineering teams to understand how the work might be done, and what work is required of other teams (e.g. the UI team might need to build a frontend whilst another team builds the API).

The team will likely estimate the work at this stage (though the exact mechanism is left up to them). Where work is required of other teams we’ll also begin to reach out to them, so they can factor it into their work for the quarter, and estimate their effort too. It’s also important at this stage to understand what kind of dependency this is — do we need one piece to fully complete before the other, or can they be done in parallel and integrated towards the end? Usually the latter.

The final aspect of planned work are unlinked EPICs — these are things that don’t necessarily contribute meaningfully to a SHIP, but the team would still like to get them completed. Examples of this are performance improvements, or changes/fixes to backend tooling.

We deliver continuously through the quarter to avoid a scramble of deployment at once, and our target dates will reflect that. We also allow anything delivered in the first two weeks of the following quarter to still count as being on-time — stability of the network is more important than hitting arbitrary dates.

We also take a fairly pragmatic approach towards target dates. A natural part of software delivery is that as we begin to explore the solution space we may uncover additional complexity. As long as we can justify a change of date it’s perfectly acceptable to amend the dates of SHIPs/EPICs to reflect the latest information. The exception to this is where we’ve made an external commitment to deliver something, so changing the delivery dates is subject to greater scrutiny.

Keeping us safe

You might think that letting teams set their own process might lead to chaos, but in my experience the opposite is true. By allowing teams to define their own methods we are empowering them to make better decisions and understand their own context within Cloudflare. We explicitly define the interfaces we use between teams, and that allows teams the flexibility to do what works best for them.

We don’t go as far as to say “there are no rules”. Last quarter Cloudflare blocked an average of 87 billion cyber threats each day, and in July 2021 we blocked the largest DDoS attack ever recorded. If we have an outage, our customers feel it, and we feel it too. To manage this we have strict, though simple, rules governing how code reaches our data centers. For example, we mandate a minimum number of reviews for each piece of code, and our deployments are phased so that changes are tested on a subset of live traffic, so any issues can be localised.

The main takeaway is to find the right balance between freedom and rules, and appreciate that this may vary for different teams within the organisation. Enforcing an unnecessarily strict process can cause a lot of friction in teams, and that’s a shortcut to losing great people. Our ideal process is one that minimises red tape, such that our team can focus on the hard job of protecting our customers.

P.S. — we’re hiring!

Do you want to come and work on advanced technologies where every code push impacts millions of Internet properties? Join our team!

How we built Instant Logs

2021-09-14 Ben Yule

Post Syndicated from Ben Yule original https://blog.cloudflare.com/how-we-built-instant-logs/

How we built Instant Logs

As a developer, you may be all too familiar with the stress of responding to a major service outage, becoming aware of an ongoing security breach, or simply dealing with the frustration of setting up a new service for the first time. When confronted with these situations, you want a real-time view into the events flowing through your network, so you can receive, process, and act on information as quickly as possible.

If you have a UNIX mindset, you’ll be familiar with tailing web service logs and searching for patterns using grep. With distributed systems like Cloudflare’s edge network, this task becomes much more complex because you’ll either need to log in to thousands of servers, or ship all the logs to a single place.

This is why we built Instant Logs. Instant Logs removes all barriers to accessing your Cloudflare logs, giving you a complete platform to view your HTTP logs in real time, with just a single click, right from within Cloudflare’s dashboard. Powerful filters then let you drill into specific events or search for patterns, and act on them instantly.

The Challenge

Today, Cloudflare’s Logpush product already gives customers the ability to ship their logs to a third-party analytics or storage provider of their choosing. While this system is already exceptionally fast, delivering logs in about 15s on average, it is optimized for completeness and the utmost certainty that your data is reliably making it to its destination. It is the ideal solution for after things have settled down, and you want to perform a forensic deep dive or retrospective.

We originally aimed to extend this system to provide our real-time logging capabilities, but we soon realized the objectives were inherently at odds with each other. In order to get all of your data, to a single place, all the time, the laws of the universe require that latencies be introduced into the system. We needed a complementary solution, with its own unique set of objectives.

This ultimately boiled down to the following

It has to be extremely fast, in human terms. This means average latencies between an event occurring at the edge and being received by the client should be under three seconds.
We wanted the system design to be simple, and communication to be as direct to the client as possible. This meant operating the dataplane entirely at the edge, eliminating unnecessary round trips to a core data center.
The pipeline needs to provide sensible results on properties of all sizes, ranging from a few requests per day to hundreds of thousands of requests per second.
The pipeline must support a broad set of user-definable filters that are applied before any sampling occurs, such that a user can target and receive exactly what they want.

Workers and Durable Objects

Our existing Logpush pipeline relies heavily on Kafka to provide sharding, buffering, and aggregation at a single, central location. While we’ve had excellent results using Kafka for these pipelines, the clusters are optimized to run only within our core data centers. Using Kafka would require extra hops to far away data centers, adding a latency penalty we were not willing to incur.

In order to keep the data plane running on the edge, we needed primitives that would allow us to perform some of the same key functions we needed out of Kafka. This is where Workers and the recently released Durable Objects come in. Workers provide an incredibly simple to use, highly elastic, edge-native, compute platform we can use to receive events, and perform transformations. Durable Objects, through their global uniqueness, allow us to coordinate messages streaming from thousands of servers and route them to a singular object. This is where aggregation and buffering are performed, before finally pushing to a client over a thin WebSocket. We get all of this, without ever having to leave the edge!

Let’s walk through what this looks like in practice.

A Simple Start

Imagine a simple scenario in which we have a single web server which produces log messages, and a single client which wants to consume them. This can be implemented by creating a Durable Object, which we will refer to as a Durable Session, that serves as the point of coordination between the server and client. In this case, the client initiates a WebSocket connection with the Durable Object, and the server sends messages to the Durable Object over HTTP, which are then forwarded directly to the client.

This model is quite quick and introduces very little additional latency other than what would be required to send a payload directly from the web server to the client. This is thanks to the fact that Durable Objects are generally located at or near the data center where they are first requested. At least in human terms, it’s instant. Adding more servers to our model is also trivial. As the additional servers produce events, they will all be routed to the same Durable Object, which merges them into a single stream, and sends them to the client over the same WebSocket.

Durable Objects are inherently single threaded. As the number of servers in our simple example increases, the Durable Object will eventually saturate its CPU time and will eventually start to reject incoming requests. And even if it didn’t, as data volumes increase, we risk overwhelming a client’s ability to download and render log lines. We’ll handle this in a few different ways.

Honing in on specific events

Filtering is the most simple and obvious way to reduce data volume before it reaches the client. If we can filter out the noise, and stream only the events of interest, we can substantially reduce volume. Performing this transformation in the Durable Object itself will provide no relief from CPU saturation concerns. Instead, we can push this filtering out to an invoking Worker, which will run many filter operations in parallel, as it elastically scales to process all the incoming requests to the Durable Object. At this point, our architecture starts to look a lot like the MapReduce pattern!

Scaling up with shards

Ok, so filtering may be great in some situations, but it’s not going to save us under all scenarios. We still need a solution to help us coordinate between potentially thousands of servers that are sending events every single second. Durable Objects will come to the rescue, yet again. We can implement a sharding layer consisting of Durable Objects, we will call them Durable Shards, that effectively allow us to reduce the number of requests being sent to our primary object.

But how do we implement this layer if Durable Objects are globally unique? We first need to decide on a shard key, which is used to determine which Durable Object a given message should first be routed to. When the Worker processes a message, the key will be added to the name of the downstream Durable Object. Assuming our keys are well-balanced, this should effectively reduce the load on the primary Durable Object by approximately 1/N.

Reaching the moon by sampling

But wait, there’s more to do. Going back to our original product requirements, “The pipeline needs to provide sensible results on properties of all sizes, ranging from a few requests per day to hundreds of thousands of requests per second.” With the system as designed so far, we have the technical headroom to process an almost arbitrary number of logs. However, we’ve done nothing to reduce the absolute volume of messages that need to be processed and sent to the client, and at high log volumes, clients would quickly be overwhelmed. To deliver the interactive, instant user experience customers expect, we need to roll up our sleeves one more time.

This is where our final trick, sampling, comes into play.

Up to this point, when our pipeline saturates, it still makes forward progress by dropping excess data as the Durable Object starts to refuse connections. However, this form of ‘uncontrolled shedding’ is dangerous because it causes us to lose information. When we drop data in this way, we can’t keep a record of the data we dropped, and we cannot infer things about the original shape of the traffic from the messages that we do receive. Instead, we implement a form of ‘controlled’ sampling, which still preserves the statistics, and information about the original traffic.

For Instant Logs, we implement a sampling technique called Reservoir Sampling. Reservoir sampling is a form of dynamic sampling that has this amazing property of letting us pick a specific k number of items from a stream of unknown length n, with a single pass through the data. By buffering data in the reservoir, and flushing it on a short (sub second) time interval, we can output random samples to the client at the maximum data rate of our choosing. Sampling is implemented in both layers of Durable Objects.

Information about the original traffic shape is preserved by assigning a sample interval to each line, which is equivalent to the number of samples that were dropped for this given sample to make it through, or 1/probability. The actual number of requests can then be calculated by taking the sum of all sample intervals within a time window. This technique adds a slight amount of latency to the pipeline to account for buffering, but enables us to point an event source of nearly any size at the pipeline, and we can expect it will be handled in a sensible, controlled way.

Putting it all together

What we are left with is a pipeline that sensibly handles wildly different volumes of traffic, from single digits to hundreds of thousands of requests a second. It allows the user to pinpoint an exact event in a sea of millions, or calculate summaries over every single one. It delivers insight within seconds, all without ever having to do more than click a button.

Best of all? Workers and Durable Objects handle this workload with aplomb and no tuning, and the available developer tooling allowed me to be productive from my first day writing code targeting the Workers ecosystem.

How to get involved?

We’ll be starting our Beta for Instant Logs in a couple of weeks. Join the waitlist to get notified about when you can get access!

If you want to be part of building the future of data at Cloudflare, we’re hiring engineers for our data team in Lisbon, London, Austin, and San Francisco!

How to execute an object file: Part 3

2021-09-10 Ignat Korchagin

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/how-to-execute-an-object-file-part-3/

Dealing with external libraries

How to execute an object file: Part 3

In the part 2 of our series we learned how to process relocations in object files in order to properly wire up internal dependencies in the code. In this post we will look into what happens if the code has external dependencies — that is, it tries to call functions from external libraries. As before, we will be building upon the code from part 2. Let’s add another function to our toy object file:

obj.c:

#include <stdio.h>
 
...
 
void say_hello(void)
{
    puts("Hello, world!");
}

In the above scenario our say_hello function now depends on the puts function from the C standard library. To try it out we also need to modify our loader to import the new function and execute it:

loader.c:

...
 
static void execute_funcs(void)
{
    /* pointers to imported functions */
    int (*add5)(int);
    int (*add10)(int);
    const char *(*get_hello)(void);
    int (*get_var)(void);
    void (*set_var)(int num);
    void (*say_hello)(void);
 
...
 
    say_hello = lookup_function("say_hello");
    if (!say_hello) {
        fputs("Failed to find say_hello function\n", stderr);
        exit(ENOENT);
    }
 
    puts("Executing say_hello...");
    say_hello();
}
...

Let’s run it:

$ gcc -c obj.c
$ gcc -o loader loader.c
$ ./loader
No runtime base address for section

Seems something went wrong when the loader tried to process relocations, so let’s check the relocations table:

$ readelf --relocs obj.o
 
Relocation section '.rela.text' at offset 0x3c8 contains 7 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000020  000a00000004 R_X86_64_PLT32    0000000000000000 add5 - 4
00000000002d  000a00000004 R_X86_64_PLT32    0000000000000000 add5 - 4
00000000003a  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
000000000046  000300000002 R_X86_64_PC32     0000000000000000 .data - 4
000000000058  000300000002 R_X86_64_PC32     0000000000000000 .data - 4
000000000066  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
00000000006b  001100000004 R_X86_64_PLT32    0000000000000000 puts - 4
...

The compiler generated a relocation for the puts invocation. The relocation type is R_X86_64_PLT32 and our loader already knows how to process these, so the problem is elsewhere. The above entry shows that the relocation references 17th entry (0x11 in hex) in the symbol table, so let’s check that:

$ readelf --symbols obj.o
 
Symbol table '.symtab' contains 18 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS obj.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     4 OBJECT  LOCAL  DEFAULT    3 var
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    8
     9: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
    10: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 add5
    11: 000000000000000f    36 FUNC    GLOBAL DEFAULT    1 add10
    12: 0000000000000033    13 FUNC    GLOBAL DEFAULT    1 get_hello
    13: 0000000000000040    12 FUNC    GLOBAL DEFAULT    1 get_var
    14: 000000000000004c    19 FUNC    GLOBAL DEFAULT    1 set_var
    15: 000000000000005f    19 FUNC    GLOBAL DEFAULT    1 say_hello
    16: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND _GLOBAL_OFFSET_TABLE_
    17: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND puts

Oh! The section index for the puts function is UND (essentially 0 in the code), which makes total sense: unlike previous symbols, puts is an external dependency, and it is not implemented in our obj.o file. Therefore, it can’t be a part of any section within obj.o.
So how do we resolve this relocation? We need to somehow point the code to jump to a puts implementation. Our loader actually already has access to the C library puts function (because it is written in C and we’ve used puts in the loader code itself already), but technically it doesn’t have to be the C library puts, just some puts implementation. For completeness, let’s implement our own custom puts function in the loader, which is just a decorator around the C library puts:

loader.c:

...
 
/* external dependencies for obj.o */
static int my_puts(const char *s)
{
    puts("my_puts executed");
    return puts(s);
}
...

Now that we have a puts implementation (and thus its runtime address) we should just write logic in the loader to resolve the relocation by instructing the code to jump to the correct function. However, there is one complication: in part 2 of our series, when we processed relocations for constants and global variables, we learned we’re mostly dealing with 32-bit relative relocations and that the code or data we’re referencing needs to be no more than 2147483647 (0x7fffffff in hex) bytes away from the relocation itself. R_X86_64_PLT32 is also a 32-bit relative relocation, so it has the same requirements, but unfortunately we can’t reuse the trick from part 2 as our my_puts function is part of the loader itself and we don’t have control over where in the address space the operating system places the loader code.

Luckily, we don’t have to come up with any new solutions and can just borrow the approach used in shared libraries.

Exploring PLT/GOT

Real world ELF executables and shared libraries have the same problem: often executables have dependencies on shared libraries and shared libraries have dependencies on other shared libraries. And all of the different pieces of a complete runtime program may be mapped to random ranges in the process address space. When a shared library or an ELF executable is linked together, the linker enumerates all the external references and creates two or more additional sections (for a refresher on ELF sections check out the part 1 of our series) in the ELF file. The two mandatory ones are the Procedure Linkage Table (PLT) and the Global Offset Table (GOT).

We will not deep-dive into specifics of the standard PLT/GOT implementation as there are many other great resources online, but in a nutshell PLT/GOT is just a jumptable for external code. At the linking stage the linker resolves all external 32-bit relative relocations with respect to a locally generated PLT/GOT table. It can do that, because this table would become part of the final ELF file itself, so it will be "close" to the main code, when the file is mapped into memory at runtime. Later, at runtime the dynamic loader populates PLT/GOT tables for every loaded ELF file (both the executable and the shared libraries) with the runtime addresses of all the dependencies. Eventually, when the program code calls some external library function, the CPU "jumps" through the local PLT/GOT table to the final code:

How to execute an object file: Part 3

Why do we need two ELF sections to implement one jumptable you may ask? Well, because real world PLT/GOT is a bit more complex than described above. Turns out resolving all external references at runtime may significantly slow down program startup time, so symbol resolution is implemented via a "lazy approach": a reference is resolved by the dynamic loader only when the code actually tries to call a particular function. If the main application code never calls a library function, that reference will never be resolved.

Implementing a simplified PLT/GOT

For learning and demonstrative purposes though we will not be reimplementing a full-blown PLT/GOT with lazy resolution, but a simple jumptable, which resolves external references when the object file is loaded and parsed. First of all we need to know the size of the table: for ELF executables and shared libraries the linker will count the external references at link stage and create appropriately sized PLT and GOT sections. Because we are dealing with raw object files we would have to do another pass over the .rela.text section and count all the relocations, which point to an entry in the symbol table with undefined section index (or 0 in code). Let’s add a function for this and store the number of external references in a global variable:

loader.c:

...
 
/* number of external symbols in the symbol table */
static int num_ext_symbols = 0;
...
static void count_external_symbols(void)
{
    const Elf64_Shdr *rela_text_hdr = lookup_section(".rela.text");
    if (!rela_text_hdr) {
        fputs("Failed to find .rela.text\n", stderr);
        exit(ENOEXEC);
    }
 
    int num_relocations = rela_text_hdr->sh_size / rela_text_hdr->sh_entsize;
    const Elf64_Rela *relocations = (Elf64_Rela *)(obj.base + rela_text_hdr->sh_offset);
 
    for (int i = 0; i < num_relocations; i++) {
        int symbol_idx = ELF64_R_SYM(relocations[i].r_info);
 
        /* if there is no section associated with a symbol, it is probably
         * an external reference */
        if (symbols[symbol_idx].st_shndx == SHN_UNDEF)
            num_ext_symbols++;
    }
}
...

This function is very similar to our do_text_relocations function. Only instead of actually performing relocations it just counts the number of external symbol references.

Next we need to decide the actual size in bytes for our jumptable. num_ext_symbols has the number of external symbol references in the object file, but how many bytes per symbol to allocate? To figure this out we need to design our jumptable format. As we established above, in its simple form our jumptable should be just a collection of unconditional CPU jump instructions — one for each external symbol. However, unfortunately modern x64 CPU architecture does not provide a jump instruction, where an address pointer can be a direct operand. Instead, the jump address needs to be stored in memory somewhere "close" — that is within 32-bit offset — and the offset is the actual operand. So, for each external symbol we need to store the jump address (64 bits or 8 bytes on a 64-bit CPU system) and the actual jump instruction with an offset operand (6 bytes for x64 architecture). We can represent an entry in our jumptable with the following C structure:

loader.c:

...
 
struct ext_jump {
    /* address to jump to */
    uint8_t *addr;
    /* unconditional x64 JMP instruction */
    /* should always be {0xff, 0x25, 0xf2, 0xff, 0xff, 0xff} */
    /* so it would jump to an address stored at addr above */
    uint8_t instr[6];
};
 
struct ext_jump *jumptable;
...

We’ve also added a global variable to store the base address of the jumptable, which will be allocated later. Notice that with the above approach the actual jump instruction will always be constant for every external symbol. Since we allocate a dedicated entry for each external symbol with this structure, the addr member would always be at the same offset from the end of the jump instruction in instr: -14 bytes or 0xfffffff2 in hex for a 32-bit operand. So instr will always be {0xff, 0x25, 0xf2, 0xff, 0xff, 0xff}: 0xff and 0x25 is the encoding of the x64 jump instruction and its modifier and 0xfffffff2 is the operand offset in little-endian format.

Now that we have defined the entry format for our jumptable, we can allocate and populate it when parsing the object file. First of all, let’s not forget to call our new count_external_symbols function from the parse_obj to populate num_ext_symbols (it has to be done before we allocate the jumptable):

loader.c:

...
 
static void parse_obj(void)
{
...
 
    count_external_symbols();
 
    /* allocate memory for `.text`, `.data` and `.rodata` copies rounding up each section to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size)...
...
}

Next we need to allocate memory for the jumptable and store the pointer in the jumptable global variable for later use. Just a reminder that in order to resolve 32-bit relocations from the .text section to this table, it has to be "close" in memory to the main code. So we need to allocate it in the same mmap call as the rest of the object sections. Since we defined the table’s entry format in struct ext_jump and have num_ext_symbols, the size of the table would simply be sizeof(struct ext_jump) * num_ext_symbols:

loader.c:

...
 
static void parse_obj(void)
{
...
 
    count_external_symbols();
 
    /* allocate memory for `.text`, `.data` and `.rodata` copies and the jumptable for external symbols, rounding up each section to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size) + \
                                   page_align(data_hdr->sh_size) + \
                                   page_align(rodata_hdr->sh_size) + \
                                   page_align(sizeof(struct ext_jump) * num_ext_symbols),
                                   PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (text_runtime_base == MAP_FAILED) {
        perror("Failed to allocate memory");
        exit(errno);
    }
 
...
    rodata_runtime_base = data_runtime_base + page_align(data_hdr->sh_size);
    /* jumptable will come after .rodata */
    jumptable = (struct ext_jump *)(rodata_runtime_base + page_align(rodata_hdr->sh_size));
 
...
}
...

Finally, because the CPU will actually be executing the jump instructions from our instr fields from the jumptable, we need to mark this memory readonly and executable (after do_text_relocations earlier in this function has completed):

loader.c:

...
 
static void parse_obj(void)
{
...
 
    do_text_relocations();
 
...
 
    /* make the jumptable readonly and executable */
    if (mprotect(jumptable, page_align(sizeof(struct ext_jump) * num_ext_symbols), PROT_READ | PROT_EXEC)) {
        perror("Failed to make the jumptable executable");
        exit(errno);
    }
}
...

At this stage we have our jumptable allocated and usable — all is left to do is to populate it properly. We’ll do this by improving the do_text_relocations implementation to handle the case of external symbols. The No runtime base address for section error from the beginning of this post is actually caused by this line in do_text_relocations:

loader.c:

...
 
static void do_text_relocations(void)
{
...
    for (int i = 0; i < num_relocations; i++) {
...
        /* symbol, with respect to which the relocation is performed */
        uint8_t *symbol_address = = section_runtime_base(&sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
...
}
...

Currently we try to determine the runtime symbol address for the relocation by looking up the symbol’s section runtime address and adding the symbol’s offset. But we have established above that external symbols do not have an associated section, so their handling needs to be a special case. Let’s update the implementation to reflect this:

loader.c:

...
 
static void do_text_relocations(void)
{
...
    for (int i = 0; i < num_relocations; i++) {
...
        /* symbol, with respect to which the relocation is performed */
        uint8_t *symbol_address;
        
        /* if this is an external symbol */
        if (symbols[symbol_idx].st_shndx == SHN_UNDEF) {
            static int curr_jmp_idx = 0;
 
            /* get external symbol/function address by name */
            jumptable[curr_jmp_idx].addr = lookup_ext_function(strtab +  symbols[symbol_idx].st_name);
 
            /* x64 unconditional JMP with address stored at -14 bytes offset */
            /* will use the address stored in addr above */
            jumptable[curr_jmp_idx].instr[0] = 0xff;
            jumptable[curr_jmp_idx].instr[1] = 0x25;
            jumptable[curr_jmp_idx].instr[2] = 0xf2;
            jumptable[curr_jmp_idx].instr[3] = 0xff;
            jumptable[curr_jmp_idx].instr[4] = 0xff;
            jumptable[curr_jmp_idx].instr[5] = 0xff;
 
            /* resolve the relocation with respect to this unconditional JMP */
            symbol_address = (uint8_t *)(&jumptable[curr_jmp_idx].instr);
 
            curr_jmp_idx++;
        } else {
            symbol_address = section_runtime_base(&sections[symbols[symbol_idx].st_shndx]) + symbols[symbol_idx].st_value;
        }
...
}
...

If a relocation symbol does not have an associated section, we consider it external and call a helper function to lookup the symbol’s runtime address by its name. We store this address in the next available jumptable entry, populate the x64 jump instruction with our fixed operand and store the address of the instruction in the symbol_address variable. Later, the existing code in do_text_relocations will resolve the .text relocation with respect to the address in symbol_address in the same way it does for local symbols in part 2 of our series.

The only missing bit here now is the implementation of the newly introduced lookup_ext_function helper. Real world loaders may have complicated logic on how to find and resolve symbols in memory at runtime. But for the purposes of this article we’ll provide a simple naive implementation, which can only resolve the puts function:

loader.c:

...
 
static void *lookup_ext_function(const char *name)
{
    size_t name_len = strlen(name);
 
    if (name_len == strlen("puts") && !strcmp(name, "puts"))
        return my_puts;
 
    fprintf(stderr, "No address for function %s\n", name);
    exit(ENOENT);
}
...

Notice though that because we control the loader logic we are free to implement resolution as we please. In the above case we actually "divert" the object file to use our own "custom" my_puts function instead of the C library one. Let’s recompile the loader and see if it works:

$ gcc -o loader loader.c
$ ./loader
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52
Executing get_hello...
get_hello() = Hello, world!
Executing get_var...
get_var() = 5
Executing set_var(42)...
Executing get_var again...
get_var() = 42
Executing say_hello...
my_puts executed
Hello, world!

Hooray! We not only fixed our loader to handle external references in object files — we have also learned how to "hook" any such external function call and divert the code to a custom implementation, which might be useful in some cases, like malware research.

As in the previous posts, the complete source code from this post is available on GitHub.

How to execute an object file: Part 1

2021-03-02 Ignat Korchagin

Post Syndicated from Ignat Korchagin original https://blog.cloudflare.com/how-to-execute-an-object-file-part-1/

Calling a simple function without linking

How to execute an object file: Part 1

When we write software using a high-level compiled programming language, there are usually a number of steps involved in transforming our source code into the final executable binary:

How to execute an object file: Part 1

First, our source files are compiled by a compiler translating the high-level programming language into machine code. The output of the compiler is a number of object files. If the project contains multiple source files, we usually get as many object files. The next step is the linker: since the code in different object files may reference each other, the linker is responsible for assembling all these object files into one big program and binding these references together. The output of the linker is usually our target executable, so only one file.

However, at this point, our executable might still be incomplete. These days, most executables on Linux are dynamically linked: the executable itself does not have all the code it needs to run a program. Instead it expects to "borrow" part of the code at runtime from shared libraries for some of its functionality:

How to execute an object file: Part 1

This process is called runtime linking: when our executable is being started, the operating system will invoke the dynamic loader, which should find all the needed libraries, copy/map their code into our target process address space, and resolve all the dependencies our code has on them.

One interesting thing to note about this overall process is that we get the executable machine code directly from step 1 (compiling the source code), but if any of the later steps fail, we still can’t execute our program. So, in this series of blog posts we will investigate if it is possible to execute machine code directly from object files skipping all the later steps.

Why would we want to execute an object file?

There may be many reasons. Perhaps we’re writing an open-source replacement for a proprietary Linux driver or an application, and want to compare if the behaviour of some code is the same. Or we have a piece of a rare, obscure program and we can’t link to it, because it was compiled with a rare, obscure compiler. Maybe we have a source file, but cannot create a full featured executable, because of the missing build time or runtime dependencies. Malware analysis, code from a different operating system etc – all these scenarios may put us in a position, where either linking is not possible or the runtime environment is not suitable.

A simple toy object file

For the purposes of this article, let’s create a simple toy object file, so we can use it in our experiments:

obj.c:

int add5(int num)
{
    return num + 5;
}

int add10(int num)
{
    return num + 10;
}

Our source file contains only 2 functions, add5 and add10, which adds 5 or 10 respectively to the only input parameter. It’s a small but fully functional piece of code, and we can easily compile it into an object file:

$ gcc -c obj.c 
$ ls
obj.c  obj.o

Loading an object file into the process memory

Now we will try to import the add5 and add10 functions from the object file and execute them. When we talk about executing an object file, we mean using an object file as some sort of a library. As we learned above, when we have an executable that utilises external shared libraries, the dynamic loader loads these libraries into the process address space for us. With object files, however, we have to do this manually, because ultimately we can’t execute machine code that doesn’t reside in the operating system’s RAM. So, to execute object files we still need some kind of a wrapper program:

loader.c:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

static void load_obj(void)
{
    /* load obj.o into memory */
}

static void parse_obj(void)
{
    /* parse an object file and find add5 and add10 functions */
}

static void execute_funcs(void)
{
    /* execute add5 and add10 with some inputs */
}

int main(void)
{
    load_obj();
    parse_obj();
    execute_funcs();

    return 0;
}

Above is a self-contained object loader program with some functions as placeholders. We will be implementing these functions (and adding more) in the course of this post.

First, as we established already, we need to load our object file into the process address space. We could just read the whole file into a buffer, but that would not be very efficient. Real-world object files might be big, but as we will see later, we don’t need all of the object’s file contents. So it is better to mmap the file instead: this way the operating system will lazily read the parts from the file we need at the time we need them. Let’s implement the load_obj function:

loader.c:

...
/* for open(2), fstat(2) */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

/* for close(2), fstat(2) */
#include <unistd.h>

/* for mmap(2) */
#include <sys/mman.h>

/* parsing ELF files */
#include <elf.h>

/* for errno */
#include <errno.h>

typedef union {
    const Elf64_Ehdr *hdr;
    const uint8_t *base;
} objhdr;

/* obj.o memory address */
static objhdr obj;

static void load_obj(void)
{
    struct stat sb;

    int fd = open("obj.o", O_RDONLY);
    if (fd <= 0) {
        perror("Cannot open obj.o");
        exit(errno);
    }

    /* we need obj.o size for mmap(2) */
    if (fstat(fd, &sb)) {
        perror("Failed to get obj.o info");
        exit(errno);
    }

    /* mmap obj.o into memory */
    obj.base = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    if (obj.base == MAP_FAILED) {
        perror("Maping obj.o failed");
        exit(errno);
    }
    close(fd);
}
...

If we don’t encounter any errors, after load_obj executes we should get the memory address, which points to the beginning of our obj.o in the obj global variable. It is worth noting we have created a special union type for the obj variable: we will be parsing obj.o later (and peeking ahead – object files are actually ELF files), so will be referring to the address both as Elf64_Ehdr (ELF header structure in C) and a byte pointer (parsing ELF files involves calculations of byte offsets from the beginning of the file).

A peek inside an object file

To use some code from an object file, we need to find it first. As I’ve leaked above, object files are actually ELF files (the same format as Linux executables and shared libraries) and luckily they’re easy to parse on Linux with the help of the standard elf.h header, which includes many useful definitions related to the ELF file structure. But we actually need to know what we’re looking for, so a high-level understanding of an ELF file is needed.

ELF segments and sections

Segments (also known as program headers) and sections are probably the main parts of an ELF file and usually a starting point of any ELF tutorial. However, there is often some confusion between the two. Different sections contain different types of ELF data: executable code (which we are most interested in in this post), constant data, global variables etc. Segments, on the other hand, do not contain any data themselves – they just describe to the operating system how to properly load sections into RAM for the executable to work correctly. Some tutorials say "a segment may include 0 or more sections", which is not entirely accurate: segments do not contain sections, rather they just indicate to the OS where in memory a particular section should be loaded and what is the access pattern for this memory (read, write or execute):

How to execute an object file: Part 1

Furthermore, object files do not contain any segments at all: an object file is not meant to be directly loaded by the OS. Instead, it is assumed it will be linked with some other code, so ELF segments are usually generated by the linker, not the compiler. We can check this by using the readelf command:

$ readelf --segments obj.o

There are no program headers in this file.

Object file sections

The same readelf command can be used to get all the sections from our object file:

$ readelf --sections obj.o
There are 11 section headers, starting at offset 0x268:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       000000000000001e  0000000000000000  AX       0     0     1
  [ 2] .data             PROGBITS         0000000000000000  0000005e
       0000000000000000  0000000000000000  WA       0     0     1
  [ 3] .bss              NOBITS           0000000000000000  0000005e
       0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .comment          PROGBITS         0000000000000000  0000005e
       000000000000001d  0000000000000001  MS       0     0     1
  [ 5] .note.GNU-stack   PROGBITS         0000000000000000  0000007b
       0000000000000000  0000000000000000           0     0     1
  [ 6] .eh_frame         PROGBITS         0000000000000000  00000080
       0000000000000058  0000000000000000   A       0     0     8
  [ 7] .rela.eh_frame    RELA             0000000000000000  000001e0
       0000000000000030  0000000000000018   I       8     6     8
  [ 8] .symtab           SYMTAB           0000000000000000  000000d8
       00000000000000f0  0000000000000018           9     8     8
  [ 9] .strtab           STRTAB           0000000000000000  000001c8
       0000000000000012  0000000000000000           0     0     1
  [10] .shstrtab         STRTAB           0000000000000000  00000210
       0000000000000054  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

There are different tutorials online describing the most popular ELF sections in detail. Another great reference is the Linux manpages project. It is handy because it describes both sections’ purpose as well as C structure definitions from elf.h, which makes it a one-stop shop for parsing ELF files. However, for completeness, below is a short description of the most popular sections one may encounter in an ELF file:

.text: this section contains the executable code (the actual machine code, which was created by the compiler from our source code). This section is the primary area of interest for this post as it should contain the add5 and add10 functions we want to use.
.data and .bss: these sections contain global and static local variables. The difference is: .data has variables with an initial value (defined like int foo = 5;) and .bss just reserves space for variables with no initial value (defined like int bar;).
.rodata: this section contains constant data (mostly strings or byte arrays). For example, if we use a string literal in the code (for example, for printf or some error message), it will be stored here. Note, that .rodata is missing from the output above as we didn’t use any string literals or constant byte arrays in obj.c.
.symtab: this section contains information about the symbols in the object file: functions, global variables, constants etc. It may also contain information about external symbols the object file needs, like needed functions from the external libraries.
.strtab and .shstrtab: contain packed strings for the ELF file. Note, that these are not the strings we may define in our source code (those go to the .rodata section). These are the strings describing the names of other ELF structures, like symbols from .symtab or even section names from the table above. ELF binary format aims to make its structures compact and of a fixed size, so all strings are stored in one place and the respective data structures just reference them as an offset in either .shstrtab or .strtab sections instead of storing the full string locally.

The `.symtab` section

At this point, we know that the code we want to import and execute is located in the obj.o‘s .text section. But we have two functions, add5 and add10, remember? At this level the .text section is just a byte blob – how do we know where each of these functions is located? This is where the .symtab (the "symbol table") comes in handy. It is so important that it has its own dedicated parameter in readelf:

$ readelf --symbols obj.o

Symbol table '.symtab' contains 10 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS obj.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    2
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     8: 0000000000000000    15 FUNC    GLOBAL DEFAULT    1 add5
     9: 000000000000000f    15 FUNC    GLOBAL DEFAULT    1 add10

Let’s ignore the other entries for now and just focus on the last two lines, because they conveniently have add5 and add10 as their symbol names. And indeed, this is the info about our functions. Apart from the names, the symbol table provides us with some additional metadata:

The Ndx column tells us the index of the section, where the symbol is located. We can cross-check it with the section table above and confirm that indeed these functions are located in .text (section with the index 1).
Type being set to FUNC confirms that these are indeed functions.
Size tells us the size of each function, but this information is not very useful in our context. The same goes for Bind and Vis.
Probably the most useful piece of information is Value. The name is misleading, because it is actually an offset from the start of the containing section in this context. That is, the add5 function starts just from the beginning of .text and add10 is located from 15th byte and onwards.

So now we have all the pieces on how to parse an ELF file and find the functions we need.

Finding and executing a function from an object file

Given what we have learned so far, let’s define a plan on how to proceed to import and execute a function from an object file:

Find the ELF sections table and .shstrtab section (we need .shstrtab later to lookup sections in the section table by name).
Find the .symtab and .strtab sections (we need .strtab to lookup symbols by name in .symtab).
Find the .text section and copy it into RAM with executable permissions.
Find add5 and add10 function offsets from the .symtab.
Execute add5 and add10 functions.

Let’s start by adding some more global variables and implementing the parse_obj function:

loader.c:

...

/* sections table */
static const Elf64_Shdr *sections;
static const char *shstrtab = NULL;

/* symbols table */
static const Elf64_Sym *symbols;
/* number of entries in the symbols table */
static int num_symbols;
static const char *strtab = NULL;

...

static void parse_obj(void)
{
    /* the sections table offset is encoded in the ELF header */
    sections = (const Elf64_Shdr *)(obj.base + obj.hdr->e_shoff);
    /* the index of `.shstrtab` in the sections table is encoded in the ELF header
     * so we can find it without actually using a name lookup
     */
    shstrtab = (const char *)(obj.base + sections[obj.hdr->e_shstrndx].sh_offset);

...
}

...

Now that we have references to both the sections table and the .shstrtab section, we can lookup other sections by their name. Let’s create a helper function for that:

loader.c:

...

static const Elf64_Shdr *lookup_section(const char *name)
{
    size_t name_len = strlen(name);

    /* number of entries in the sections table is encoded in the ELF header */
    for (Elf64_Half i = 0; i < obj.hdr->e_shnum; i++) {
        /* sections table entry does not contain the string name of the section
         * instead, the `sh_name` parameter is an offset in the `.shstrtab`
         * section, which points to a string name
         */
        const char *section_name = shstrtab + sections[i].sh_name;
        size_t section_name_len = strlen(section_name);

        if (name_len == section_name_len && !strcmp(name, section_name)) {
            /* we ignore sections with 0 size */
            if (sections[i].sh_size)
                return sections + i;
        }
    }

    return NULL;
}

...

Using our new helper function, we can now find the .symtab and .strtab sections:

loader.c:

...

static void parse_obj(void)
{
...

    /* find the `.symtab` entry in the sections table */
    const Elf64_Shdr *symtab_hdr = lookup_section(".symtab");
    if (!symtab_hdr) {
        fputs("Failed to find .symtab\n", stderr);
        exit(ENOEXEC);
    }

    /* the symbols table */
    symbols = (const Elf64_Sym *)(obj.base + symtab_hdr->sh_offset);
    /* number of entries in the symbols table = table size / entry size */
    num_symbols = symtab_hdr->sh_size / symtab_hdr->sh_entsize;

    const Elf64_Shdr *strtab_hdr = lookup_section(".strtab");
    if (!strtab_hdr) {
        fputs("Failed to find .strtab\n", stderr);
        exit(ENOEXEC);
    }

    strtab = (const char *)(obj.base + strtab_hdr->sh_offset);
    
...
}

...

Next, let’s focus on the .text section. We noted earlier in our plan that it is not enough to just locate the .text section in the object file, like we did with other sections. We would need to copy it over to a different location in RAM with executable permissions. There are several reasons for that, but these are the main ones:

Many CPU architectures either don’t allow execution of the machine code, which is unaligned in memory (4 kilobytes for x86 systems), or they execute it with a performance penalty. However, the .text section in an ELF file is not guaranteed to be positioned at a page aligned offset, because the on-disk version of the ELF file aims to be compact rather than convenient.
We may need to modify some bytes in the .text section to perform relocations (we don’t need to do it in this case, but will be dealing with relocations in future posts). If, for example, we forget to use the MAP_PRIVATE flag, when mapping the ELF file, our modifications may propagate to the underlying file and corrupt it.
Finally, different sections, which are needed at runtime, like .text, .data, .bss and .rodata, require different memory permission bits: the .text section memory needs to be both readable and executable, but not writable (it is considered a bad security practice to have memory both writable and executable). The .data and .bss sections need to be readable and writable to support global variables, but not executable. The .rodata section should be readonly, because its purpose is to hold constant data. To support this, each section must be allocated on a page boundary as we can only set memory permission bits on whole pages and not custom ranges. Therefore, we need to create new, page aligned memory ranges for these sections and copy the data there.

To create a page aligned copy of the .text section, first we actually need to know the page size. Many programs usually just hardcode the page size to 4096 (4 kilobytes), but we shouldn’t rely on that. While it’s accurate for most x86 systems, other CPU architectures, like arm64, might have a different page size. So hard coding a page size may make our program non-portable. Let’s find the page size and store it in another global variable:

loader.c:

...

static uint64_t page_size;

static inline uint64_t page_align(uint64_t n)
{
    return (n + (page_size - 1)) & ~(page_size - 1);
}

...

static void parse_obj(void)
{
...

    /* get system page size */
    page_size = sysconf(_SC_PAGESIZE);

...
}

...

Notice, we have also added a convenience function page_align, which will round up the passed in number to the next page aligned boundary. Next, back to the .text section. As a reminder, we need to:

Find the .text section metadata in the sections table.
Allocate a chunk of memory to hold the .text section copy.
Actually copy the .text section to the newly allocated memory.
Make the .text section executable, so we can later call functions from it.

Here is the implementation of the above steps:

loader.c:

...

/* runtime base address of the imported code */
static uint8_t *text_runtime_base;

...

static void parse_obj(void)
{
...

    /* find the `.text` entry in the sections table */
    const Elf64_Shdr *text_hdr = lookup_section(".text");
    if (!text_hdr) {
        fputs("Failed to find .text\n", stderr);
        exit(ENOEXEC);
    }

    /* allocate memory for `.text` copy rounding it up to whole pages */
    text_runtime_base = mmap(NULL, page_align(text_hdr->sh_size), PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (text_runtime_base == MAP_FAILED) {
        perror("Failed to allocate memory for .text");
        exit(errno);
    }

    /* copy the contents of `.text` section from the ELF file */
    memcpy(text_runtime_base, obj.base + text_hdr->sh_offset, text_hdr->sh_size);

    /* make the `.text` copy readonly and executable */
    if (mprotect(text_runtime_base, page_align(text_hdr->sh_size), PROT_READ | PROT_EXEC)) {
        perror("Failed to make .text executable");
        exit(errno);
    }
}

...

Now we have all the pieces we need to locate the address of a function. Let’s write a helper for it:

loader.c:

...

static void *lookup_function(const char *name)
{
    size_t name_len = strlen(name);

    /* loop through all the symbols in the symbol table */
    for (int i = 0; i < num_symbols; i++) {
        /* consider only function symbols */
        if (ELF64_ST_TYPE(symbols[i].st_info) == STT_FUNC) {
            /* symbol table entry does not contain the string name of the symbol
             * instead, the `st_name` parameter is an offset in the `.strtab`
             * section, which points to a string name
             */
            const char *function_name = strtab + symbols[i].st_name;
            size_t function_name_len = strlen(function_name);

            if (name_len == function_name_len && !strcmp(name, function_name)) {
                /* st_value is an offset in bytes of the function from the
                 * beginning of the `.text` section
                 */
                return text_runtime_base + symbols[i].st_value;
            }
        }
    }

    return NULL;
}

...

And finally we can implement the execute_funcs function to import and execute code from an object file:

loader.c:

...

static void execute_funcs(void)
{
    /* pointers to imported add5 and add10 functions */
    int (*add5)(int);
    int (*add10)(int);

    add5 = lookup_function("add5");
    if (!add5) {
        fputs("Failed to find add5 function\n", stderr);
        exit(ENOENT);
    }

    puts("Executing add5...");
    printf("add5(%d) = %d\n", 42, add5(42));

    add10 = lookup_function("add10");
    if (!add10) {
        fputs("Failed to find add10 function\n", stderr);
        exit(ENOENT);
    }

    puts("Executing add10...");
    printf("add10(%d) = %d\n", 42, add10(42));
}

...

Let’s compile our loader and make sure it works as expected:

$ gcc -o loader loader.c 
$ ./loader 
Executing add5...
add5(42) = 47
Executing add10...
add10(42) = 52

Voila! We have successfully imported code from obj.o and executed it. Of course, the example above is simplified: the code in the object file is self-contained, does not reference any global variables or constants, and does not have any external dependencies. In future posts we will look into more complex code and how to handle such cases.

Security considerations

Processing external inputs, like parsing an ELF file from the disk above, should be handled with care. The code from loader.c omits a lot of bounds checking and additional ELF integrity checks, when parsing the object file. The code is simplified for the purposes of this post, but most likely not production ready, as it can probably be exploited by specifically crafted malicious inputs. Use it only for educational purposes!

The complete source code from this post can be found here.

Managed Entitlements in AWS License Manager Streamlines License Tracking and Distribution for Customers and ISVs

2020-12-03 Harunobu Kameda

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/managed-entitlements-for-aws-license-manager-streamlines-license-management-for-customers-and-isvs/

AWS License Manager is a service that helps you easily manage software licenses from vendors such as Microsoft, SAP, Oracle, and IBM across your Amazon Web Services (AWS) and on-premises environments. You can define rules based on your licensing agreements to prevent license violations, such as using more licenses than are available. You can set the rules to help prevent licensing violations or notify you of breaches. AWS License Manager also offers automated discovery of bring your own licenses (BYOL) usage that keeps you informed of all software installations and uninstallations across your environment and alerts you of licensing violations.

License Manager can manage licenses purchased in AWS Marketplace, a curated digital catalog where you can easily find, purchase, deploy, and manage third-party software, data, and services to build solutions and run your business. Marketplace lists thousands of software listings from independent software vendors (ISVs) in popular categories such as security, networking, storage, machine learning, business intelligence, database, and DevOps.

Managed entitlements for AWS License Manager
Starting today, you can use managed entitlements, a new feature of AWS License Manager that lets you distribute licenses across your AWS Organizations, automate software deployments quickly and track licenses – all from a single, central account. Previously, each of your users would have to independently accept licensing terms and subscribe through their own individual AWS accounts. As your business grows and scales, this becomes increasingly inefficient.

Customers can use managed entitlements to manage more than 8,000 listings available for purchase from more than 1600 vendors in the AWS Marketplace. Today, AWS License Manager automates license entitlement distribution for Amazon Machine Image, Containers and Machine Learning products purchased in the Marketplace with a variety of solutions.

How It Works
Managed entitlements provides built-in controls that allow only authorized users and workloads to consume a license within vendor-defined limits. This new license management mechanism also eliminates the need for ISVs to maintain their own licensing systems and conduct costly audits.

Each time a customer purchases licenses from AWS Marketplace or a supported ISV, the license is activated based on AWS IAM credentials, and the details are registered to License Manager.

Administrators distribute licenses to AWS accounts. They can manage a list of grants for each license.

Benefits for ISVs
AWS License Manager managed entitlements provides several benefits to ISVs to simplify the automatic license creation and distribution process as part of their transactional workflow. License entitlements can be distributed to end users with and without AWS accounts. Managed entitlements streamlines upgrades and renewals by removing expensive license audits and provides customers with a self-service tracking tool with built-in license tracking capabilities. There are no fees for this feature.

Managed entitlements provides the ability to distribute licenses to end users who do not have AWS accounts. In conjunction with the AWS License Manager, ISVs create a unique long-term token to identify the customer. The token is generated and shared with the customer. When the software is launched, the customer enters the token to activate the license. The software exchanges the long-term customer token for a short-term token that is passed to the API and the setting of the license is completed. For on-premises workloads that are not connected to the Internet, ISVs can generate a host-specific license file that customers can use to run the software on that host.

Now Available
This new enhancement to AWS License Manager is available today for US East (N. Virginia), US West (Oregon), and Europe (Ireland) with other AWS Regions coming soon.

Licenses purchased on AWS Marketplace are automatically created in AWS License Manager and no special steps are required to use managed entitlements. For more details about the new feature, see the managed entitlement pages on AWS Marketplace, and the documentation. For ISVs to use this new feature, please visit our getting started guide.

Get started with AWS License Manager and the new managed entitlements feature today.

– Kame

За блогърстването и още нещо

2020-11-25 Yovko Lambrev

Post Syndicated from Yovko Lambrev original https://yovko.net/za-blogarstvaneto/

Този текст е провокиран от Иван, който обзет от носталгия по доброто, старо блогърстване ме сръчка да споделя и аз някакви мисли по темата. А вероятно и да имам повод напиша нещо тук. Oт последният ми пост са минали три месеца. Вероятно до края на тази особена 2020 година няма да напиша друг, но пък бих използвал време от почивните дни около Коледа да реорганизирам сайта си и своето online присъствие. За пореден път. Иначе казано, щом виждам смисъл да го правя, значи не съм се отказал окончателно.

Макар и да съзнавам напълно, че няма как да е като преди.

Интересно съвпадение е, че някъде точно по това време миналата година убих най-накрая Facebook профила си. И не просто го замразих – изтрих го. С искрена и неприкрита ненавист към тази зловеща платформа.

Разбира се, това не значи, че съм се разписал повече тук.

Блогването имаше повече стойност, когато Google Reader беше жив и беше платформа за комуникация и диалог – кой какво чете, кой какво споделя от своите прочетени неща. Днес това все още е възможно – например в Inoreader. Но вече е много по-трудно да събереш всички на едно място с подобна цел (ако не си от hi-tech гигантите), а друга е темата дали това изобщо е добра идея. Един дистрибутиран RSS-четец с подобна функционалност ще е най-доброто възможно решение.

Уви, за щастие RSS е още жив, но натискът да бъде убит този чуден протокол за споделяне на съдържание е огромен. От една страна големият враг са всички, които искат да обвързват потребители, данни и съдържание със себе си, а от друга немарливостта на web-разработчиците и дигиталните маркетолози, които проповядват, че RSS е мъртъв и не си струва усилията. Днес все по-често сайтовете са с нарочно премахнат или спрян RSS.

Без такива неща, блоговете в момента са като мегафон, на който са извадени батериите – островчета, които продължават да носят смисъл и значение, но като в легендата за хан Кубрат – няма я силата на съчките събрани в сноп.

Всъщност вината ни е обща. Защото се подхлъзваме като малки деца на шаренко по всяка заигралка в мрежата, без да оценяваме плюсовете и минусите, още по-малко щетите, които би могла да нанесе. Интернет вече е платформа с по-голямо социално, отколкото технологично значение – оценка на въздействието би трябвало да бъде задължителна стъпка в тестването на всяка нова идея или проект в мрежата. И то не толкова от този, който я пуска, защото обикновено авторът е заслепен от мечти и жажда за слава (а често и неприкрита алчност), а от първите, че и последващите потребители.

Блогърите предадохме блогването, защото „глей к'во е яко да се пра'иш на интересен в 140 знака в твитър“ или „как ши стана мега-хипер-секси инфлуенцър с facebook и instagram в три куци стъпки“. Нищо, че социалките може и да те усилват (до едно време), но после само ти крадат трафика. И се превърщат в посредник със съмнителна добавена стойност, но пък взимащ задължително своето си.

Това всъщност беше само върхът на айсберга, защото от самото начало технооптимизмът беше в повече. В много посоки и отношения. И ако това в романтичните времена на съзиданието бе оправдано, и дори полезно – сега ни е нужен технореализъм и критично отношение за да поправим счупеното.

Интернет днес не е това, което трябваше да бъде. Гравитацията на гигантите засмука всичко. И продължава. И ако не искаме да се озовем в черна дупка са нужни общи усилия за децентрализирането му. Това няма да е лесно, защото този път създателите няма да имат много съюзници от страна на бизнеса. Предстои трудна война с гигантите и влиянието им. Единственият могъщ съюзник сме си ние хората, ако успеем да си обясним помежду си защо е нужно да се съпротивляваме.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Заглавна снимка: Annie Spratt

За блогърстването и още нещо

2020-11-25 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/za-blogarstvaneto/

Макар и да съзнавам напълно, че няма как да е като преди.

Разбира се, това не значи, че съм се разписал повече тук.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Заглавна снимка: Annie Spratt

За блогърстването и още нещо

2020-11-25 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/za-blogarstvaneto/

Макар и да съзнавам напълно, че няма как да е като преди.

Разбира се, това не значи, че съм се разписал повече тук.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Заглавна снимка: Annie Spratt

За блогърстването и още нещо

2020-11-25 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/za-blogarstvaneto/

Макар и да съзнавам напълно, че няма как да е като преди.

Разбира се, това не значи, че съм се разписал повече тук.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Заглавна снимка: Annie Spratt

За блогърстването и още нещо

2020-11-25 Yovko Lambrev

Post Syndicated from Yovko Lambrev original https://yovko.net/za-blogarstvaneto/

Този текст е провокиран от публикация на Иван, който обзет от носталгия по доброто, старо блогърстване ме сръчка да споделя и аз някакви мисли по темата. А вероятно и да имам повод напиша нещо тук. Oт последният ми пост са минали три месеца. Вероятно до края на тази особена 2020 година няма да напиша друг, но пък бих използвал време от почивните дни около Коледа да реорганизирам сайта си и своето online присъствие. За пореден път. Иначе казано, щом виждам смисъл да го правя, значи не съм се отказал окончателно.

Макар и да съзнавам напълно, че няма как да е като преди.

Разбира се, това не значи, че съм се разписал повече тук.

Блогърите предадохме блогването, защото „глей к’во е яко да се пра’иш на интересен в 140 знака в твитър“ или „как ши стана мега-хипер-секси инфлуенцър с facebook и instagram в три куци стъпки“. Нищо, че социалките може и да те усилват (до едно време), но после само ти крадат трафика. И се превърщат в посредник със съмнителна добавена стойност, но пък взимащ задължително своето си.

Засега не успяваме.

P.S. Т.е. псст, подхвърлям топката към Ясен и Мария.

Last phase of the desktop wars?

2020-09-25 Armed and Dangerous

Post Syndicated from Armed and Dangerous original http://esr.ibiblio.org/?p=8764

The two most intriguing developments in the recent evolution of the Microsoft Windows operating system are Windows System for Linux (WSL) and the porting of their Microsoft Edge browser to Ubuntu.

For those of you not keeping up, WSL allows unmodified Linux binaries to run under Windows 10. No emulation, no shim layer, they just load and go.

Microsoft developers are now landing features in the Linux kernel to improve WSL. And that points in a fascinating technical direction. To understand why, we need to notice how Microsoft’s revenue stream has changed since the launch of its cloud service in 2010.

Ten years later, Azure makes Microsoft most of its money. The Windows monopoly has become a sideshow, with sales of conventional desktop PCs (the only market it dominates) declining. Accordingly, the return on investment of spending on Windows development is falling. As PC volume sales continue to fall off , it’s inevitably going to stop being a profit center and turn into a drag on the business.

Looked at from the point of view of cold-blooded profit maximization, this means continuing Windows development is a thing Microsoft would prefer not to be doing. Instead, they’d do better putting more capital investment into Azure – which is widely rumored to be running more Linux instances than Windows these days.

Our third ingredient is Proton. Proton is the emulation layer that allows Windows games distributed on Steam to run over Linux. It’s not perfect yet, but it’s getting close. I myself use it to play World of Warships on the Great Beast.

The thing about games is that they are the most demanding possible stress test for a Windows emulation layer, much more so than business software. We may already be at the point where Proton-like technology is entirely good enough to run Windows business software over Linux. If not, we will be soon.

So, you’re a Microsoft corporate strategist. What’s the profit-maximizing path forward given all these factors?

It’s this: Microsoft Windows <em>becomes</em> a Proton-like emulation layer over a Linux kernel, with the layer getting thinner over time as more of the support lands in the mainline kernel sources. The economic motive is that Microsoft sheds an ever-larger fraction of its development costs as less and less has to be done in-house.

If you think this is fantasy, think again. The best evidence that it’s already the plan is that Microsoft has already ported Edge to run under Linux. There is only one way that makes any sense, and that is as a trial run for freeing the rest of the Windows utility suite from depending on any emulation layer.

So, the end state this all points at is: New Windows is mostly a Linux kernel, there’s an old-Windows emulation over it, but Edge and the rest of the Windows user-land utilities <em>don’t use the emulation.</em> The emulation layer is there for games and other legacy third-party software.

Economic pressure will be on Microsoft to deprecate the emulation layer. Partly because it’s entirely a cost center. Partly because they want to reduce the complexity cost of running Azure. Every increment of Windows/Linux convergence helps with that – reduces administration and the expected volume of support traffic.

Eventually, Microsoft announces upcoming end-of-life on the Windows emulation. The OS itself , and its userland tools, has for some time already been Linux underneath a carefully preserved old-Windows UI. Third-party software providers stop shipping Windows binaries in favor of ELF binaries with a pure Linux API…

…and Linux finally wins the desktop wars, not by displacing Windows but by co-opting it. Perhaps this is always how it had to be.

Хей

2020-08-15 Йовко Ламбрев

Post Syndicated from Йовко Ламбрев original https://yovko.net/hey/

Хей

Идната година електронната поща ще навърши половин век. И до днес тя е от фундаменталните мрежови услуги, без които комуникацията в Интернет нямаше да бъде същата.

През последните 25 години, през които аз лично използвам електронна поща, се нагледах на какви ли не „революционни“ идеи за развитие на имейла. Куцо и сакато се напъваше да поправя технологията. Какви ли не маркетингови и други усилия се потрошиха да обясняват на света колко счупена и неадекватна била електронната поща и как именно поредното ново изобретение щяло да бъде убиецът на имейла.

Таратанци! Имейлът си е жив и здрав. И продължава да надживява мимолетните изпръцквания на всичките му конкуренти дотук.

Факт… имейлът не е съвършен. На този свят живеят твърде много амбициозни човечета, чийто интелект не стига за много повече от това да се пънат да злоупотребяват с всевъзможни неща. Част от тях пълнят пощенските ни кутии със СПАМ (и виртуалните, и физическите ни пощи). Други едни гадинки прекаляват с маркетинговите си стремежи да ни дебнат и профилират и ни пускат писъмца с „невидими“ вградени пиксели и всевъзможни подобни подслушвачки, за да знаят дали и кога сме отворили безценните им скучни и шаблонни бюлетини и дали са ни прилъгали да щракнем някъде из тях преди да ги запратим в кошчето. Трети тип досадници често са скъпите ни колеги, които копират половината фирма за щяло или не щяло. Уви, много често така е наредил шефът или го изисква корпоративната „култура“.

Да, четенето (и отговарянето) на имейли се е превърнало в тегоба заради твърде много спам, маркетинг и глупости, които циркулират насам-натам (много често за всеки случай). И защото почти никой не полага усилия да спазва някакъв имейл етикет.

Аз съм абсолютен фен на имейла, в есенциалния му вид. Не споделям идеята, че той е счупен или неудобен. Ако човек положи някакви усилия да организира входящата си поща с малко филтри и поддиректории и си създаде навик да чете поща не повече от един-два пъти дневно, животът става малко по-светъл. И да – спрял съм всички автоматични уведомления за нови писма по смартфони, декстопи, таблети и др. Електронната поща е за асинхронна комуникация, а именно да чета и пиша, когато мога и искам. Това не е чат!

Затова изцвилих от удоволствие, когато научих, че Basecamp работят по своя email услуга, с идеята да се преборят с част от досадните неща около имейла. И веднага се записах на опашката за желаещи да я пробват. Вече два месеца я използвам активно и смея да твърдя, че за първи път е постигнато нещо смислено по темата да направим имейла по-добър. Нещото се казва Hey.com и е платена персонална услуга за личен email. Планира се в бъдеще да се предлага и бизнес опция със собствен домейн, но засега може да се получи само личен адрес от типа [email protected]

Няма да скрия, че някои неща в началото не ми се понравиха съвсем. И най-вече това, че няма как да се ползва стандартен клиент за поща като Thunderbird или Apple Mail, защото услугата е недостъпна по IMAPS/SMTPS. Но смисълът да се ползват само и единствено специалните клиенти на Hey е в напълно различния подход към електронната поща и специфичните за услугата функционалности, които няма как да се ползват през стандартен mail клиент. Така че цената на този компромис си струва.

Hey не прилича на нищо друго, което досега съм ползвал за поща. Има нужда от малко свикване, но само след ден или два аз лично не искам да виждам и чувам за никаква друга пощенска услуга или mail клиент.

Философията е следната – аз имам пълен контрол върху това каква поща искам да получвам. Всичката входяща поща минава първоначално през едно нещо, наречено screener, който спира всяко писмо, което пристига за първи път от адрес, който досега не ми е писал. Ако там попадне писмо от някой досадник или спамър аз просто отбелязвам, че не желая да чувам за него и… край. Никога повече няма да чуя и видя писмо от него. Е, освен ако не реши да ми пише от друг адрес, но така пак ще попадне в скрийнъра и пак мога да го маркирам като нежелан кореспондент.

Всъщност такова писмо не се връща обратно, изпращачът му не получава никаква обратна връзка какво се случва, но аз просто никога няма да го видя. Кеф! 🙂 Перфектно оръжие срещу досадници!

На теория това работи и срещу спамъри, но обикновено голяма част от спама се филтрира предварително и дори не достига до скрийнъра, макар че от време на време се случва. Случва се също и някое полезно писмо да се маркира като SPAM, но доста рядко. В крайна сметка, никой антиспам филтър не е съвършен, особено когато някои писма наистина приличат на СПАМ заради платформата, през която са изпратени, или поради немарливостта на изпращача им.

Иначе казано, поне еднократно, аз трябва да позволя на всеки, който би искал да ми пише, да може да го прави. И това се случва при първото получено от него писмо. Демек нямате втори шанс да направите първоначално добро впечатление. 🙂

С времето мога да размисля и да заглуша такива, които съм допуснал да ми пишат, или обратно – да позволя на такива, които съм бил заглушил.

Писмата, които съм се съгласил да получвам, се разпределят най-общо в три категории – едната се нарича Paper Trail и обикновено е за писма, които са за някакви чисто информативни цели, без да изискват отговор. Най-често потвърждения за платени сметки, за сменени пароли и други такива неща. Другата е наречена The Feed и е предназначена за бюлетини, маркетингови послания, неща за четене, когато имам време за тях. И третата, която всъщност е основната, се нарича Imbox (от important box), където са всички тези писма, които не са спам, не са само за информация, не са маркетингови или други читанки. Тези писма обикновено изискват внимание, а най-често и отговор. Те са истинската ми поща и важните неща. Та, цялата идея е в Imbox-а ми да достигат само такива писма.

За всеки кореспондент мога да определям дали писмата му да остават в Imbox, или да ходят в The Feed или Paper Trail. На всеки email мога да реша да отговоря веднага или да си отбележа някои от тях за по-късно (reply later). Да маркирам някаква важна информация в тях (clip) или да ги заделя настрани (aside) по някаква своя причина. Разбира се, мога и да си създавам свои етикети и да си отбелязвам различни писма с тях по някакви лични критерии и съображения.

Идеята за inbox zero тук я няма. Няма го и безполезното архивиране, което реално е преместване на писма. Важната ви поща си се трупа в Imbox – етикетирана или не (по ваше желание). Всичко, което ви е нужно, е една добра търсачка. Е, имате я. Както и пространство от цели 100GB.

Има и други глезотии, които могат да се прегледат тук. А може и да се направи тестов акаунт за вкусване на услугата и интерфейса ѝ. Всичко може да се използва директно през браузър, но има приложения за мобилни телефони, както и за десктоп операционни системи.

И още нещо, което просто не мога да не отбележа, защото ме спечели окончателно и ми доставя перверзно удоволствие. Всички вградени в писмата пиксели за проследяване, парчета код, маркетингови шитни и въобще всякакви такива номера биват обезвреждани автоматично, а писмата – маркирани, че са съдържали такива гадинки. Няма не искам, няма недей! Kill 'em all! Плачете, маркетинг гурута! Ронете кървави сълзи! Не работят вашите hubspot буби, salesforce бози, facebook пиксели, mailchimp вендузи и прочие маркетингови кърлежи. Сега в hey.com, а скоро и по други интернет ширини! 🙂

С две думи казано, Hey е за всички, които ценят времето си. За тези, за които е важно имейлът да е полезен инструмент, а не пиявица на ценно време. Hey ще се хареса на всички, които ползват интензивно електронна поща като основна комуникация, защото ще внесе спокойствие в преживяването им с личния им имейл. Такова, че понякога се чудя дали наистина нямам нова поща и дали нещо не се е счупило.

Hey не е за тези, които търсят к'ва да е ефтинка пощичка в abv.bg, gmail.com и прочие. Цената е $99 на година. Няма отстъпки, по-базови или по-премиални планове, нито някакви сложни опции. Услугата е само една и толкова. Струва си обаче всеки цент. А адресът, който сте си взели, си остава завинаги ваш (пренасочва се), дори да се откажете след време.

P.S. Хей! Нямам никакви взаимоотношения с Basecamp или екипа им. Просто съм фен на продуктите, но и на философията им за бизнеса и живота. Платил съм за услугата с лични средства и по собствено желание.

Documentation as knowledge capture

2020-07-14 esr

Post Syndicated from esr original http://esr.ibiblio.org/?p=8741

Maybe you’re one of the tiny minority of programmers that, like me, already enjoys writing documentation and works hard at doing it right. If so,the rest of this essay is not for you and you can skip it.

Otherwise, you might want to re-read (or at least re-skim) Ground-Truth Documents before continuing. Because ground-truth documents are a special case of a more general reason why you might want to try to change your mindset about documentation.

In that earlier essay I used the term “knowledge capture” in passing. This is a term of art from AI; it refers to the process of extracting domain knowledge from the heads of human experts into a form that can be expressed as an algorithm executable by the literalistic logic of a computer.

What I invite you to think about now is how writing documentation for software you are working on can save you pain and effort by (a) capturing knowledge you have but don’t know you have, and (b) eliciting knowledge that you have not yet developed.

Humans, including me and you, are sloppy and analogical thinkers who tend to solve problems by pattern-matching against noisy data first and checking our intuitions with logic after the fact (if we actually get that far). There’s no point in protesting that it shouldn’t be that way, that we should use rigorous logic all the way down, because our brains simply aren’t wired for that. Evolved cognition is a kludge – more properly, multiple stacks of kludges – developed under selection to be just barely adequate at coping.

This kludginess is revealed by, for example, optical illusions. And by the famous 7±2 result about the very limited sized of the human working set. And the various well-documented ways that human beings are extremely bad at statistical reasoning. And in many other ways…

When you do work that is as demanding of rigor as software engineering, one of your central challenges is hacking around the limitations of your own brain. Sometimes this develops in very obvious ways; the increasing systematization of testing during development during the last couple of decades, for example.

Other brain hacks are more subtle. Which is why I am here to suggest that you try to stop thinking of documentation as a chore you do for others, and instead think of it as a way to explore your problem space. and the space in your head around your intuitions about the problem, so you can shine light into the murkier corners of both. Writing documentation can function as valuable knowledge capture about your problem domain even when you are the only expert about what you are trying to do.

This is why my projects often have a file called “designer’s notes” or “hacking guide”. Early in a project these may just be random jottings that are an aid to my own memory about why things are the way they are. They tend to develop into a technical briefing about the code internals for future contributors. This is a good habit to form if you want to have future contributors!

But even though the developed version of a “designer’s notes” looks other-directed, it’s really a thing I do to reduce my own friction costs. And not just in the communication-to-my-future-self way either. Yes, it’s tremendously valuable to have a document that, months or years after I wrote it, reminds me of my assumptions when I have half-forgotten them. And yes, a “designer’s notes” file is good practice for that reason alone. But its utility does not even start there, let alone end there.

Earlier, I wrote of (a) capturing knowledge you have but don’t know you have, and (b) eliciting knowledge that you have not yet developed. The process of writing your designer’s notes can be powerful and catalytic that way even if they’re never communicated. The thing you have to do in your brain to narratize your thoughts so they can be written down is itself an exploratory tool.

As with “designer’s notes” so with every other form of documentation from the one-line code comment to a user-oriented HOWTO. When you achieve right mindset about these they are no longer burdens; instead they become an integral part of your creative process, enabling you to design better and write better code with less total effort.

I understand that to a lot of programmers who now experience writing prose as difficult work this might seem like impossible advice. But I think there is a way from where you are to right mindset. That way is to let go of the desire for perfection in your prose, at least early on. Sentence fragments are OK. Misspellings are OK. Anything you write that explores the space is OK, no matter how barbarous it would look to your third-grade grammar teacher or the language pedants out there (including me).

It is more important to do the discovery process implied by writing down your ideas than it is for the result to look polished. If you hold on to that thought, get in the habit of this kind of knowledge capture, and start benefiting from it, then you might find that over time your standards rise and it gets easier to put more effort into polishing.

If that happens, sure; let it happen – but it’s not strictly necessary. The only thing that is necessary is that you occasionally police what you’ve recorded so it doesn’t drift into reporting something the software no longer does. That sort of thing is a land-mine for anyone else who might read your notes and very bad form.

Other than that, though, the way to get to where you do the documentation-as-knowledge-capture thing well is by starting small; allow a low bar for polish and completeness and grow the capability organically. You will know you have won when it starts being fun.

Overview

Architecture

Prerequisites

Setup

Install Anbox cloud

Stream an Android game application

Test the game

Clean-up

Conclusion

See you next time!

See you next time!

Intro

It starts with an oops

What has happened?

The suspicious offset

Packet path

Super-packets

The reproducer

The bug

The fix

Outro

The Waterfall Approach

Agile Approaches

Which approach does Cloudflare follow?

“SHIP”s and “Epic”s

Planning

Keeping us safe

P.S. — we’re hiring!

The Challenge

Workers and Durable Objects

A Simple Start

Honing in on specific events

Scaling up with shards

Reaching the moon by sampling

Putting it all together

How to get involved?

Dealing with external libraries

Exploring PLT/GOT

Implementing a simplified PLT/GOT

Calling a simple function without linking

Why would we want to execute an object file?

A simple toy object file

Loading an object file into the process memory

A peek inside an object file

ELF segments and sections

Object file sections

The .symtab section

Finding and executing a function from an object file

Security considerations

The collective thoughts of the interwebz

The `.symtab` section