Tag Archives: Engineering

Debugging network stalls on Kubernetes

Post Syndicated from Theo Julienne original https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/

We’ve talked about Kubernetes before, and over the last couple of years it’s become the standard deployment pattern at GitHub. We now run a large portion of both internal and public-facing services on Kubernetes. As our Kubernetes clusters have grown, and our targets on the latency of our services have become more stringent, we began to notice that certain services running on Kubernetes in our environment were experiencing sporadic latency that couldn’t be attributed to the performance characteristics of the application itself.

Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries. Services were expected to be able to respond to requests in well under 100ms, which wasn’t feasible when the connection itself was taking so long. Separately, we also observed very fast MySQL queries, which we expected to take a matter of milliseconds and that MySQL observed taking only milliseconds, were being observed taking 100ms or more from the perspective of the querying application.

The problem was initially narrowed down to communications that involved a Kubernetes node, even if the other side of a connection was outside Kubernetes. The most simple reproduction we had was a Vegeta benchmark that could be run from any internal host, targeting a Kubernetes service running on a node port, and would observe the sporadically high latency. In this post, we’ll walk through how we tracked down the underlying issue.

Removing complexity to find the path at fault

Using an example reproduction, we wanted to narrow down the problem and remove layers of complexity. Initially, there were too many moving parts in the flow between Vegeta and pods running on Kubernetes to determine if this was a deeper network problem, so we needed to rule some out.

vegeta to kube pod via NodePort

The client, Vegeta, creates a TCP connection to any kube-node in the cluster. Kubernetes runs in our data centers as an overlay network (a network that runs on top of our existing datacenter network) that uses IPIP (which encapsulates the overlay network’s IP packet inside the datacenter’s IP packet). When a connection is made to that first kube-node, it performs stateful Network Address Translation (NAT) to convert the kube-node’s IP and port to an IP and port on the overlay network (specifically, of the pod running the application). On return, it undoes each of these steps. This is a complex system with a lot of state, and a lot of moving parts that are constantly updating and changing as services deploy and move around.

As part of running a tcpdump on the original Vegeta benchmark, we observed the latency during a TCP handshake (between SYN and SYN-ACK). To simplify some of the complexity of HTTP and Vegeta, we can use hping3 to just “ping” with a SYN packet and see if we observe the latency in the response packet—then throw away the connection. We can filter it to only include packets over 100ms and get a simpler reproduction case than a full Layer 7 Vegeta benchmark or attack against the service. The following “pings” a kube-node using TCP SYN/SYN-ACK on the “node port” for the service (30927) with an interval of 10ms, filtered for slow responses:

[email protected] ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Our first new observation from the sequence numbers and timings is that this isn’t a one-off, but is often grouped, like a backlog that eventually gets processed.

Next up, we want to narrow down which component(s) were potentially at fault. Is it the kube-proxy iptables NAT rules that are hundreds of rules long? Is it the IPIP tunnel and something on the network handling them poorly? One way to validate this is to test each step of the system. What happens if we remove the NAT and firewall logic and only use the IPIP part:

hping 3 on the IPIP tunnel alone

Linux thankfully lets you just talk directly to an overlay IP when you’re on a machine that’s part of the same network, so that’s pretty easy to do:

[email protected] ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Based on our results, the problem still remains! That rules out iptables and NAT. Is it TCP that’s the problem? Let’s see what happens when we perform a normal ICMP ping:

[email protected] ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

Our results show that the problem still exists. Is it the IPIP tunnel that’s causing the problem? Let’s simplify things further:

hping 3 directly between hosts

Is it possible that it’s every packet between these two hosts?

[email protected] ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Behind the complexity, it’s as simple as two kube-node hosts sending any packet, even ICMP pings, to each other. They’ll still see the latency, if the target host is a “bad” one (some are worse than others).

Now there’s one last thing to question: we clearly don’t observe this everywhere, so why is it just on kube-node servers? And does it occur when the kube-node is the sender or the receiver? Luckily, this is also pretty easy to narrow down by using a host outside Kubernetes as a sender, but with the same “known bad” target host (from a staff shell host to the same kube-node). We can observe this is still an issue in that direction:

[email protected] ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

And then perform the same from the previous source kube-node to a staff shell host (which rules out the source host, since a ping has both an RX and TX component):

[email protected] ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

Looking into the packet captures of the latency we observed, we get some more information. Specifically, that the “sender” host (bottom) observes this timeout while the “receiver” host (top) does not—see the Delta column (in seconds):

packet captures of the latency

Additionally, by looking at the difference between the ordering of the packets (based on the sequence numbers) on the receiver side of the TCP and ICMP results above, we can observe that ICMP packets always arrive in the same sequence they were sent, but with uneven timing, while TCP packets are sometimes interleaved, but a subset of them stall. Notably, we observe that if you count the ports of the SYN packets, the ports are not in order on the receiver side, while they’re in order on the sender side.

There is a subtle difference between how modern server NICs—like we have in our data centers—handle packets containing TCP vs ICMP. When a packet arrives, the NIC hashes the packet “per connection” and tries to divvy up the connections across receive queues, each (approximately) delegated to a given CPU core. For TCP, this hash includes both source and destination IP and port. In other words, each connection is hashed (potentially) differently. For ICMP, just the IP source and destination are hashed, since there are no ports.

Another new observation is that we can tell that ICMP observes stalls on all communications between the two hosts during this period from the sequence numbers in ICMP vs TCP, while TCP does not. This tells us that the RX queue hashing is likely in play, almost certainly indicating the stall is in processing RX packets, not in sending responses.

This rules out kube-node transmits, so we now know that it’s a stall in processing packets, and that it’s on the receive side on some kube-node servers.

Deep dive into Linux kernel packet processing

To understand why the problem could be on the receiving side on some kube-node servers, let’s take a look at how the Linux kernel processes packets.

Going back to the simplest traditional implementation, the network card receives a packet and sends an interrupt to the Linux kernel stating that there’s a packet that should be handled. The kernel stops other work, switches context to the interrupt handler, processes the packet, then switches back to what it was doing.

traditional interrupt-driven approach

This context switching is slow, which may have been fine on a 10Mbit NIC in the 90s, but on modern servers where the NIC is 10G and at maximal line rate can bring in around 15 million packets per second, on a smaller server with eight cores that could mean the kernel is interrupted millions of times per second per core.

Instead of constantly handling interrupts, many years ago Linux added NAPI, the networking API that modern drivers use for improved performance at high packet rates. At low rates, the kernel still accepts interrupts from the NIC in the method we mentioned. Once enough packets arrive and cross a threshold, it disables interrupts and instead begins polling the NIC and pulling off packets in batches. This processing is done in a “softirq”, or software interrupt context. This happens at the end of syscalls and hardware interrupts, which are times that the kernel (as opposed to userspace) is already running.

NAPI polling in softirq at end of syscalls

This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called ksoftirqd (you’ll see one of these in ps for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.

NAPI polling crossing threshold in schedule ksoftirqd

With an overview of the way the kernel is processing packets, we can see there is definitely opportunity for this processing to become stalled. If the time between softirq processing calls grows, packets could sit in the NIC RX queue for a while before being processed. This could be something deadlocking the CPU core, or it could be something slow preventing the kernel from running softirqs.

Narrow down processing to a core or method

At this point, it makes sense that this could happen, and we know we’re observing something that looks a lot like it. The next step is to confirm this theory, and if we do, understand what’s causing it.

Let’s revisit the slow round trip packets we saw before:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

As discussed previously, these ICMP packets are hashed to a single NIC RX queue and processed by a single CPU core. If we want to understand what the kernel is doing, it’s helpful to know where (cpu core) and how (softirq, ksoftirqd) it’s processing these packets so we can catch them in action.

Now it’s time to use the tools that allow live tracing of a running Linux kernel—bcc is what was used here. This allows you to write small C programs that hook arbitrary functions in the kernel, and buffer events back to a userspace Python program which can summarize and return them to you. The “hook arbitrary functions in the kernel” is the difficult part, but it actually goes out of its way to be as safe as possible to use, because it’s designed for tracing exactly this type of production issue that you can’t simply reproduce in a testing or dev environment.

The plan here is simple: we know the kernel is processing those ICMP ping packets, so let’s hook the kernel function icmp_echo which takes an incoming ICMP “echo request” packet and initiates sending the ICMP “echo response” reply. We can identify the packet using the incrementing icmp_seq shown by hping3 above.

The code for this bcc script looks complex, but breaking it down it’s not as scary as it sounds. The icmp_echo function is passed a struct sk_buff *skb, which is the packet containing the ICMP echo request. We can delve into this live and pull out the echo.sequence (which maps to the icmp_seq shown by hping3 above), and send that back to userspace. Conveniently, we can also grab the current process name/id as well. This gives us results like the following, live as the kernel processes these packets:

TGID    PID     PROCESS NAME    ICMP_SEQ
0       0       swapper/11      770
0       0       swapper/11      771
0       0       swapper/11      772
0       0       swapper/11      773
0       0       swapper/11      774
20041   20086   prometheus      775
0       0       swapper/11      776
0       0       swapper/11      777
0       0       swapper/11      778
4512    4542   spokes-report-s  779

One thing to note about this process name is that in a post-syscall softirq context, you see the process that made the syscall show as the “process”, even though really it’s the kernel processing it safely within the kernel context.

With that running, we can now correlate back from the stalled packets observed with hping3 to the process that’s handling it. A simple grep on that capture for the icmp_seq values with some context shows what happened before these packets were processed. The packets that line up with the above hping3 icmp_seq values have been marked along with the rtt’s we observed above (and what we’d have expected if <50ms rtt’s weren’t filtered out):

TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT
--
10137   10436   cadvisor        1951
10137   10436   cadvisor        1952
76      76      ksoftirqd/11    1953 ** 99ms
76      76      ksoftirqd/11    1954 ** 89ms
76      76      ksoftirqd/11    1955 ** 79ms
76      76      ksoftirqd/11    1956 ** 69ms
76      76      ksoftirqd/11    1957 ** 59ms
76      76      ksoftirqd/11    1958 ** (49ms)
76      76      ksoftirqd/11    1959 ** (39ms)
76      76      ksoftirqd/11    1960 ** (29ms)
76      76      ksoftirqd/11    1961 ** (19ms)
76      76      ksoftirqd/11    1962 ** (9ms)
--
10137   10436   cadvisor        2068
10137   10436   cadvisor        2069
76      76      ksoftirqd/11    2070 ** 75ms
76      76      ksoftirqd/11    2071 ** 65ms
76      76      ksoftirqd/11    2072 ** 55ms
76      76      ksoftirqd/11    2073 ** (45ms)
76      76      ksoftirqd/11    2074 ** (35ms)
76      76      ksoftirqd/11    2075 ** (25ms)
76      76      ksoftirqd/11    2076 ** (15ms)
76      76      ksoftirqd/11    2077 ** (5ms)

The results tells us a few things. First, these packets are being processed by ksoftirqd/11 which conveniently tells us this particular pair of machines have their ICMP packets hashed to core 11 on the receiving side. We can also see that every time we see a stall, we always see some packets processed in cadvisor’s syscall softirq context, followed by ksoftirqd taking over and processing the backlog, exactly the number we’d expect to work through the backlog.

The fact that cadvisor is always running just prior to this immediately also implicates it in the problem. Ironically, cadvisor “analyzes resource usage and performance characteristics of running containers”, yet it’s triggering this performance problem. As with many things related to containers, it’s all relatively bleeding-edge tooling which can result in some somewhat expected corner cases of bad performance.

What is cadvisor doing to stall things?

With the understanding of how the stall can happen, the process causing it, and the CPU core it’s happening on, we now have a pretty good idea of what this looks like. For the kernel to hard block and not schedule ksoftirqd earlier, and given we see packets processed under cadvisor’s softirq context, it’s likely that cadvisor is running a slow syscall which ends with the rest of the packets being processed:

slow syscall causing stalled packet processing on NIC RX queue

That’s a theory but how do we validate this is actually happening? One thing we can do is trace what’s running on the CPU core throughout this process, catch the point where the packets are overflowing budget and processed by ksoftirqd, then look back a bit to see what was running on the CPU core. Think of it like taking an x-ray of the CPU every few milliseconds. It would look something like this:

tracing of cpu to catch bad syscall and preceding work

Conveniently, this is something that’s already mostly supported. The perf record tool samples a given CPU core at a certain frequency and can generate a call graph of the live system, including both userspace and the kernel. Taking that recording and manipulating it using a quick fork of a tool from Brendan Gregg’s FlameGraph that retained stack trace ordering, we can get a one-line stack trace for each 1ms sample, then get a sample of the 100ms before ksoftirqd is in the trace:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

This results in the following:
(hundreds of traces that look similar)

cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

There’s a lot there, but looking through it you can see it’s the cadvisor-then-ksoftirqd pattern we saw from the ICMP tracer above. What does it mean?

Each line is a trace of the CPU at a point in time. Each call down the stack is separated by ; on that line. Looking at the middle of the lines we can see the syscall being called is read(): .... ;do_syscall_64;sys_read; ... So cadvisor is spending a lot of time in a read() syscall relating to mem_cgroup_* functions (the top of the call stack / end of line).

The call stack trace isn’t convenient to see what’s being read, so let’s use strace to see what cadvisor is doing and find 100ms-or-slower syscalls:

[email protected] ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0\.[1-9]'
[pid 10436] <... futex resumed> )       = 0 <0.156784>
[pid 10432] <... futex resumed> )       = 0 <0.258285>
[pid 10137] <... futex resumed> )       = 0 <0.678382>
[pid 10384] <... futex resumed> )       = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> )       = 0 <0.104614>
[pid 10436] <... futex resumed> )       = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> )   = 0 <0.118113>
[pid 10382] <... pselect6 resumed> )    = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> )       = 0 <0.917495>
[pid 10436] <... futex resumed> )       = 0 <0.208172>
[pid 10417] <... futex resumed> )       = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 576 <0.154442>

Sure enough, we see the slow read() calls. From the content being read and mem_cgroup context above, these read() calls are to a memory.stat file which shows the memory usage and limits of a cgroup (the resource isolation technology used by Docker). cadvisor is polling this file to get resource utilization details for the containers. Let’s see if it’s the kernel or cadvisor that’s doing something unexpected by attempting the read ourselves:

[email protected] ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real    0m0.153s
user    0m0.000s
sys    0m0.152s
[email protected] ~ $

Since we can reproduce it, this indicates that it’s the kernel hitting a pathologically bad case.

What causes this read to be so slow

At this point it’s much more simple to find similar issues reported by others. As it turns out, this has been reported to cadvisor as an excessive CPU usage problem, it just hadn’t been observed that latency was also being introduced to the network stack randomly as well. In fact, some folks internally had noticed cadvisor was consuming more CPU than expected, but it didn’t seem to be causing an issue since our servers had plenty of CPU capacity, and so the CPU usage hadn’t yet been investigated.

The overview of the issue is that the memory cgroup is accounting for memory usage inside a namespace (container). When all processes in that cgroup exit, the memory cgroup is released by Docker. However, “memory” isn’t just process memory, and although processes memory usage itself is gone, it turns out the kernel also assigns cached content like dentries and inodes (directory and file metadata) that are cached to the memory cgroup. From that issue:

“zombie” cgroups: cgroups that have no processes and have been deleted but still have memory charged to them (in my case, from the dentry cache, but it could also be from page cache or tmpfs).

Rather than the kernel iterating over every page in the cache at cgroup release time, which could be very slow, they choose to wait for those pages to be reclaimed and then finally clean up the cgroup once all are reclaimed when memory is needed, lazily. In the meantime, the cgroup still needs to be counted during stats collection.

From a performance perspective, they are trading off time on a slow process by amortizing it over the reclamation of each page, opting to make the initial cleanup fast in return for leaving some cached memory around. That’s fine, when the kernel reclaims the last of the cached memory, the cgroup eventually gets cleaned up, so it’s not really a “leak”. Unfortunately the search that memory.stat performs, the way it’s implemented on the kernel version (4.9) we’re running on some servers, combined with the huge amount of memory on our servers, means it can take a significantly long time for the last of the cached data to be reclaimed and for the zombie cgroup to be cleaned up.

It turns out we had nodes that had such a large number of zombie cgroups that some had reads/stalls of over a second.

The workaround on that cadvisor issue, to immediately free the dentries/inodes cache systemwide, immediately stopped the read latency, and also the network latency stalls on the host, since the dropping of the cache included the cached pages in the “zombie” cgroups and so they were also freed. This isn’t a solution, but it does validate the cause of the issue.

As it turns out newer kernel releases (4.19+) have improved the performance of the memory.stat call and so this is no longer a problem after moving to that kernel. In the interim, we had existing tooling that was able to detect problems with nodes in our Kubernetes clusters and gracefully drain and reboot them, which we used to detect the cases of high enough latency that would cause issues, and treat them with a graceful reboot. This gave us breathing room while OS and kernel upgrades were rolled out to the remainder of the fleet.

Wrapping up

Since this problem manifested as NIC RX queues not being processed for hundreds of milliseconds, it was responsible for both high latency on short connections and latency observed mid-connection such as between MySQL query and response packets. Understanding and maintaining performance of our most foundational systems like Kubernetes is critical to the reliability and speed of all services that build on top of them. As we invest in and improve on this performance, every system we run benefits from those improvements.

The post Debugging network stalls on Kubernetes appeared first on The GitHub Blog.

How we implemented domain-driven development in Golang

Post Syndicated from Grab Tech original https://engineering.grab.com/domain-driven-development-in-golang

Partnerships have always been core to Grab’s super app strategy. We believe in collaborating with partners who are the best in what they do – combining their expertise with what we’re good at so that we can bring high-quality new services to our customers, at the same time create new opportunities for the merchant and driver-partners in our ecosystem.

That’s why we launched GrabPlatform last year. To make it easier for partners to either integrate Grab into their services, or integrate their services into Grab.

In view of that, part of the GrabPlatform’s team mission is to make it easy for partners to integrate with Grab services. These partners are external companies that would like to offer Grab’s services such as ride-booking through their own websites or applications. To do that, we decided to build a website that will serve as a one-stop-shop that would allow them to self-service these integrations.

The challenges we faced with the conventional approach

In the process of building this website, our team noticed that the majority of the functions and responsibilities were added to files without proper segregation. A single file would contain more than 500 lines of code. Each of these files were imported from different collections of source codes, resulting in an unstructured codebase. Any changes to the existing functions risked breaking existing functionality; we realized then that we needed to proactively plan for the future. Hence, we decided to use the principles of Domain-Driven Design (DDD) and idiomatic Go. This blog aims to demonstrate the process of how we leveraged those concepts to design a modern application.

How we implemented DDD in our codebase

Here’s how we went about solving our unstructured codebase using DDD principles.

Step 1: Gather domain (business) knowledge

We collaborated closely with our domain experts (in our case, this was our product team) to identify functionality and flow. From them, we discovered the following key points:

  • After creating a project, developers are added to the project.
  • The domain experts wanted an ability to add other products (e.g. Pricing service, ETA service, GrabPay service) to their projects.
  • They wanted the ability to create multiple authentication clients to access the above products.

Step 2: Break down domain knowledge into bounded context

Now that we had gathered the required domain knowledge (i.e. what our code needed to reflect to our partners), it was time to use the DDD strategic tool Bounded Context to break down problems into subcontexts. Here is a graphical representation of how we converted the problem into smaller units.

Bounded Context

We identified several dependencies on each of the units involved in the project. Take some of these examples:

  • The project domain overlapped with the product and developer domains.
  • Our RideBooking project can only exist if it has some products like Ridebooking APIs and not the other way around.

What this means is a product can exist independent of the project, but a project will have no significance without any product. In the same way, a project is dependent on the developers, but developers can exist whether or not they belong to a project.

Step 3: Identify value objects or entity (lowest layer)

Looking at the above bounded contexts, we figured out the building blocks (i.e. value objects or entity) to break down the above functionality and flow.

// ProjectDAO ...
type ProjectDAO struct {
  ID            int64
  UUID          string
  Status        ProjectStatus
  CreatedAt     time.Time
}

// DeveloperDAO ...
type DeveloperDAO struct {
  ID            int64
  UUID          string
  PhoneHash     *string
  Status        Status
  CreatedAt     time.Time
}

// ProductDAO ...
type ProductDAO struct {
  ID            int64
  UUID          string
  Name          string
  Description   *string
  Status        ProductStatus
  CreatedAt     time.Time
}

// DeveloperProjectDAO to map developer's to a project
type DeveloperProjectDAO struct {
  ID            int64
  DeveloperID   int64
  ProjectID     int64
  Status        DeveloperProjectStatus
}

// ProductProjectDAO to map product's to a project
type ProductProjectDAO struct {
  ID            int64
  ProjectID     int64
  ProductID     int64
  Status        ProjectProductStatus
}

All the objects shown above have ID as a field and can be identifiable, hence they are identified as entities and not as value objects. But if we apply domain knowledge, DeveloperProjectDAO and ProductProjectDAO are actually not independent entities. Project object is the aggregate root since it must exist before the child fields, DevProjectDAO and ProdcutProjectDAO, can exist.

Step 4: Create the repositories

As stated above, we created an interface to abstract the working logic of a particular domain (i.e. Repository). Here is an example of how we designed the repositories:

// ProductRepositoryImpl responsible for product functionality
type ProductRepositoryImpl struct {
  productDao storage.IProductDao // private field
}

type ProductRepository interface {
  GetProductsByIDs(ctx context.Context, ids []int64) ([]IProduct, error)
}

// DeveloperRepositoryImpl
type DeveloperRepositoryImpl struct {
  developerDAO storage.IDeveloperDao // private field
}

type DeveloperRepository interface {
  FindActiveAllowedByDeveloperIDs(ctx context.Context, developerIDs []interface{}) ([]*Developer, error)
  GetDeveloperDetailByProfile(ctx context.Context, developerProfile *appdto.DeveloperProfile) (IDeveloper, error)
}

Here is a look at how we designed our repository for aggregate root project:

// Unexported Struct
type productProjectRepositoryImpl struct {
  productProjectDAO storage.IProjectProductDao // private field
}

type ProductProjectRepository interface {
  GetAllProjectProductByProjectID(ctx context.Context, projectID int64) ([]*ProjectProduct, error)
}

// Unexported Struct
type developerProjectRepositoryImpl struct {
  developerProjectDAO storage.IDeveloperProjectDao // private field
}

type DeveloperProjectRepository interface {
  GetDevelopersByProjectIDs(ctx context.Context, projectIDs []interface{}) ([]*DeveloperProject, error)
  UpdateMappingWithRole(ctx context.Context, developer IDeveloper, project IProject, role string) (*DeveloperProject, error)
}

// Unexported Struct
type projectRepositoryImpl struct {
  projectDao storage.IProjectDao // private field
}

type ProjectRepository interface {
  GetProjectsByIDs(ctx context.Context, projectIDs []interface{}) ([]*Project, error)
  GetActiveProjectByUUID(ctx context.Context, uuid string) (IProject, error)
  GetProjectByUUID(ctx context.Context, uuid string) (*Project, error)
}

type ProjectAggregatorImpl struct {
  projectRepositoryImpl           // private field
  developerProjectRepositoryImpl  // private field
  productProjectRepositoryImpl    // private field
}

type ProjectAggregator interface {
  GetProjects(ctx context.Context) ([]*dto.Project, error)
  AddDeveloper(ctx context.Context, request *appdto.AddDeveloperRequest) (*appdto.AddDeveloperResponse, error)
  GetProjectWithProducts(ctx context.Context, uuid string) (IProject, error)
}

Step 5: Identify Domain Events

The functions described in Step 4 only returns the ID of the developer and product, which conveys no information to the users. In order to provide developer and product information, we use the domain-event technique to return the actual product and developer attributes.

A domain event is something that happened in a bounded context that you want another context of a domain to be aware of. For example, if there are new updates to the developer domain, it’s important to convey these updates to the project domain. This propagation technique is termed as domain event. Domain events enable independence between different classes.

One way to implement it is seen here:

// file: project\_aggregator.go
func (p *ProjectAggregatorImpl) GetProjects(ctx context.Context) ([]*dto.Project, error) {
  ....
  ....
  developers := p.EventHandler.Handle(DomainEvent.FindDeveloperByDeveloperIDs{DeveloperIDs})
  ....
}

// file: event\_type.go
type FindDeveloperByDeveloperIDs struct{ developerID []interface{} }

// file: event\_handler.go
func (e *EventHandler) Handle(event interface{}) interface{} {
  switch op := event.(type) {
      case FindDeveloperByDeveloperIDs:
            developers, _ := e.developerRepository.FindDeveloperByDeveloperIDs(op.developerIDs)
            return developers
      case ....
      ....
    }
}
Domain Event

Some common mistakes to avoid when implementing DDD in your codebase:

  • Not engaging with domain experts. Not interacting with domain experts is a common mistake when using DDD. Talking to domain experts to get an understanding of the problem domain from their perspective is at the core of DDD. Starting with schemas or data modelling instead of talking to domain experts may create code based on a relational model instead of it built around a domain model.
  • Ignoring the language of the domain experts. Creating a ubiquitous language shared with domain experts is also a core DDD practice. This common language must be used in all discussions as well as in the code, e.g. in class and method names.
  • Not identifying bounded contexts. A common approach to solving a complex problem is breaking it down into smaller parts. Creating bounded contexts is breaking down a large domain into smaller ones, each handling one cohesive part of the domain.
  • Using an anaemic domain model. This is a common sign that a team is not doing DDD and often a symptom of a failure in the modelling process. At first, an anaemic domain model often looks like a real domain model with correct names, but the classes lack functionalities. They contain only the Get and Set methods.

How the DDD model improved our software development

Thanks to this brand new clean up, we achieved the following:

  • Core functionalities are evenly distributed to the overall codebase and not limited to just a few files.
  • The developers are aware of what each folder is responsible for by simply looking at the file naming and folder structure.
  • The risk of breaking major functionalities by merely making small changes is greatly reduced. Changing a feature is now more efficient.

The team now finds the code well structured and we require less hand-holding for onboarders, thanks to the simplicity of the structure.

Finally, the most important thing, we now have a system oriented towards our business necessities. Everyone ends up using the same language and terms. Developers communicate better with the business team. The work is more efficient when it comes to establishing solutions for the models that reflect how the business operates, instead of how the software operates.

Lessons Learnt

  • Use DDD to collaborate among all project disciplines (product, business, partner, and so on) and clearly understand the business requirements.
  • Establish a ubiquitous language to discuss domain-related concepts.
  • Use bounded contexts to break down complex domains into manageable parts.
  • Implement a layered architecture (i.e. DDD building blocks) to focus on particular aspects of the application.
  • To simplify your dependency, use domain event to communicate with sub-bounded context.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2

Post Syndicated from Amanda Pinsker original https://github.blog/2019-10-02-get-started-easier-with-github-desktop-2-2/

Anyone who uses Git knows that it has a steep learning curve. We’ve learned from developers that most people tend to learn from a buddy, whether that’s a coworker, a professor, a friend, or even a YouTube video. In GitHub Desktop 2.2, we’re releasing the first version of an interactive Git and GitHub tutorial that can be your buddy and help you get started. If you’re new to Desktop, you can download and try out the tutorial at desktop.github.com.

Get set up

To get set up, we help you through two major pieces: creating a repository and connecting an editor. When you first open Desktop, a welcome page appears with a new option to “Create a Tutorial Repository”. Starting with this option creates a tutorial repository that guides you through the core concepts of working with Git using GitHub Desktop.

There are a lot of tools you need to get started with Git and GitHub. The most important of these is your code editor. In the first step of the tutorial, you’re prompted to install an editor if you don’t have one already.

Learn the GitHub flow

Next, we guide you through how to use GitHub Desktop to make changes to code locally and get your work on GitHub. You’ll create a new branch, make a change to a file, commit it, push it to GitHub, and open your first pull request.

We’ve also heard that new users initially experience confusion between Git, GitHub, and GitHub Desktop. We cover these differences in the tutorial and make sure to reinforce the explanations.

Keep going with your own project

In GitHub Desktop 1.6, we introduced suggested next steps based on the state of your repository. Now when you complete the tutorial, we similarly suggest next steps: exploring projects on GitHub that you might want to contribute to, creating a new project, or adding an existing project to Desktop. We always want GitHub Desktop to be the tool that makes your next steps clear, whether you’re in the flow of your work, or you’re a new developer just getting started.

What’s next?

With GitHub Desktop 2.2, we’re making the product our users love more approachable to newcomers. We’ll be iterating on the tutorial based on your feedback, and we’ll continue to build on the connection between GitHub and your local machine. If you want to start building something but don’t know how, think of GitHub Desktop as your buddy to help you get started.

Learn more about GitHub Desktop

The post Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2 appeared first on The GitHub Blog.

New workflow editor for GitHub Actions

Post Syndicated from Chris Patterson original https://github.blog/2019-10-01-new-workflow-editor-for-github-actions/

It’s now even easier to create and edit a GitHub Actions workflow with the updated editor. We’ve provided inline auto-complete and lint as you type so you can say goodbye to YAML indentation issues and explore the full workflow syntax without going to the docs.

Get through code faster with auto-complete

Auto-complete can be triggered with Ctrl+Space almost anywhere, and in some cases, automatically. It suggests keys or values depending on the current position of the cursor, and displays brief contextual documentation so you can discover and understand all the available options without losing the focus. Auto-complete works even inside expressions.

Explore context options without losing focus

Additionally, the editor will insert a new line with the right indentation when you press enter. It will also either suggest the next key to insert or insert code snippets that you can navigate through using the tab key.

Snippets also work when using functions inside expressions, allowing you to easily write and navigate through the required arguments.

Minimize mistakes

There are many updates to help you write workflow files, but if you make a mistake, we’ve got you covered too.

The editor now highlights structural errors in your file, unexpected values, or even conflicting values, such as an invalid shell value for the chosen operating system.

Understand cron expressions quickly

Finally, the editor will also help you edit your scheduled jobs by describing the cron expressions you have used with natural language:

Tell us what you think

These are just a few features we’ve made to help you edit workflows with fewer errors. We’d love to hear any feedback you have—share your questions and comments with us in the GitHub Actions Community Forum.

Learn more about GitHub Actions

The post New workflow editor for GitHub Actions appeared first on The GitHub Blog.

Introducing the CodeSearchNet challenge

Post Syndicated from Janey Jack original https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/

Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.

With our partners from Weights & Biases, today we’re announcing the CodeSearchNet Challenge evaluation environment and leaderboard. We’re also releasing a large dataset to help data scientists build models for this task, as well as several baseline models showing the current state of the art. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search tools.

Learn more from our technical report 

The CodeSearchNet Corpus and models

We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.

Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:

  • Six million methods overall
  • Two million of which have associated documentation (docstrings, JavaDoc, and more)
  • Metadata that indicates the original location (repository or line number, for example) where the data was found

Building on our earlier efforts in semantic code search, we’re also releasing a collection of baseline models leveraging modern techniques in learning from sequences (including a BERT-like self-attentional model) to help data scientists get started on code search. 

The CodeSearchNet Challenge

To evaluate code search models, we collected an initial set of code search queries and had programmers annotate the relevance of potential results. We started by collecting common search queries from Bing that had high click-through rates to code and combined these with queries from StaQC, yielding 99 queries for concepts related to code (i.e., we removed everything that was just an API documentation lookup).

We then used a standard Elasticsearch installation and our baseline models to obtain 10 likely results per query from our CodeSearchNet Corpus. Finally, we asked programmers, data scientists, and machine learning researchers to annotate the proposed results for relevance to the query on a scale from zero (“totally irrelevant”) to three (“exact match”). See our technical report for an in-depth explanation of the annotation process and data.

We want to expand our evaluation dataset to include more languages, queries, and annotations in the future. As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.

Other use cases

We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code. We’re excited to see what the community builds next.

Special thanks

The CodeSearchNet Challenge would not be possible without the Microsoft Research Team and core contributors from GitHub, including Marc Brockschmidt, Miltos Allamanis, Ho-Hsiang Wu, Hamel Husain, and Tiferet Gazit.

We’re also thankful for all of the contributors from the community who helped put this project together:

@nbardy, @raubitsj, @staceysv, @cvphelps, @tejaskannan, @s-zanella, @AntonioND, @goutham7r, @campoy, @cal58, @febuiles, @letmaik, @sebastiandziadzio, @panthap2, @CoderPat.


Learn more about the CodeSearchNet Challenge

The post Introducing the CodeSearchNet challenge appeared first on The GitHub Blog.

About being a Principal Engineer at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/about-being-a-principal-engineer-at-grab

Over the past few years Grab has grown from a small startup to one of the largest technology companies in South-East Asia. Along with the company’s growth, the number of microservices, features and teams also grew substantially. At the time of writing this blog, we have around 350 microservices powering our super-app.

A great engineering team is a critical component of our success. As an engineer you have two career paths in front of you: an individual contributor role, or a management role. While a management role is generally better understood, this article clarifies what it means to be a principal engineer at Grab, which is one of the highest levels of our engineering career ladder.

Engineering Career Ladder

Improving the Quality

“You set the standard for engineering excellence in your technical family. Your architectures are exemplary in terms of efficiency, stability, extensibility, testability and the ability to evolve over time. Your software is robust in the presence of failures, scalable, and cost-effective. Your coding practices are exemplary in terms of code organization, clarity, simplicity, error handling, and documentation. You tackle intrinsically hard problems, acquiring expertise as needed. You decompose complex problems into straightforward solutions.” – Grab’s Engineering Career Ladder

 

So, what does a principal engineer do? As your career progresses from junior to senior to lead engineer we have more and more responsibilities; you manage larger and larger systems. For example, junior engineer might manage a specific component of a micro-service. A senior engineer would be tasked with designing and operating an entire micro-service or product. While a lead engineer would typically be concerned with the architecture at a team level.

Principal engineer level is akin to a senior manager where instead of indirectly managing people (manager of managers) you take care of the architecture of an entire sub-organisation, known as Tech Family/Platform. These Tech Families usually have more than 50 engineers spread across multiple teams and function as a tiny company with their own business owners, designers, product managers, etc.

Challenging Projects

“You take engineering ownership of projects that may require the work of several teams to implement; you divide responsibilities so that each team can work independently and have the system come together into an integrated whole. Your projects often cross team, tech family, platform, and even R&D center boundaries. You solicit differing views and keep an open mind. You are adept at building consensus.” – Grab’s Engineering Career Ladder

 

As a principal engineer, your job is to solve larger problems and translate somewhat vague problems into a set of actionable items. You might be faced with a large problem such as “improve efficiency and interoperability of Grab’s transportation system.” You will need to understand the problem, the business impact and see how can it be improved. It might require you to design new systems, change existing systems, understand the costs involved and get the right people together to make it happen.

Solving such a problem all by yourself is pretty much impossible. You have to work with other managers and other engineers together as a team to make it happen. Help your lead/senior engineers to design the right system by giving them a clear objective but let them take care of the system-level architecture.  

You will also need to work with managers, advise them to get things done, and get the right things prioritised by the team. While you don’t need to be well-versed in project management and agile methodologies, you do need to be able to plan ahead with your teams and have an understanding of how much time a project or migration will take.

A Tech Family can easily have 20 or more micro-services. You need to have a good understanding of their functional requirements and interactions. This is challenging as learning new things is always “uncomfortable” and takes time. You must reach out to engineers, product managers, and data scientists, ideally face-to-face to build empathy. Keep asking questions and try to understand how things work. You will also need to read the existing documentation and their code.

Technical Ownership

“You are the origin of significant technical contributions to our architecture and infrastructure.  You take technical ownership of the design and quality of the security, performance, availability, and operational aspects of the software built by one or more teams. You identify where your time is needed, transitioning between coding, design, and architecture based on project and team needs. You deliver software in ways that empower teams to self-service, providing clear adoption/migration paths.” – Grab’s Engineering Career Ladder

 

As a principal engineer you work together with the Head of Engineering and managers within the Tech Family and improve the quality of systems across the board. Typically, no-one tells you what needs to be done. You need to identify gaps, raise them and keep improving the systems.

You also need to learn how to manage your own time better so you can prioritise effectively. This boils down to knowing your strengths, your weaknesses. For example, if you are really good in building distributed systems but have no clue about the latest-and-greatest design in information security, get the right InfoSec engineers in this meeting and consider skipping it yourself. Avoid trying to do everything at once and be in every single meeting you get invited – you still have to review code, design and focus, so plan accordingly.

You will also need to understand the business impact of your decisions. For example, if you contribute to product features, know how impactful this feature is going to be to the organisation. If you don’t know it – ask the Product Manager responsible for it. If you work on a platform feature, for example improving the build system, know how it will help: saving 30 minutes of build time for every engineer each day is a huge achievement.

More often than not, you will have to drive migrations, this is akin to code refactoring but on a system-level and will involve a lot of collaboration with the people. Understand what a technical debt is and how it can be mitigated – a good architecture minimises technical debt and in turn accelerates time-to-market and helps business flourish.

Technical Leadership

“You amplify your impact by leading design reviews for complex software and/or critical features. You probe assumptions, illuminate pitfalls, and foster shared understanding. You align teams toward coherent architectural strategies.” – Grab’s Engineering Career Ladder

 

In Grab we have a process known as RFC (Request For Comments) which allows engineers to submit designs and ideas for a larger audience to debate. This is especially important given that our organisation is spread across several continents with research and development offices in Southeast Asia, the US, India and China. While any engineer is welcome to comment on these RFCs, it is a duty of lead and principal engineers’ to review them on a regular basis. This will help you to expand your knowledge of existing systems and help others with improving their designs.

Communication is a key skill that you need to keep improving and it is often the Achilles’ heel of many engineers who would rather be doing work in their corner without talking to anyone else. This is perfectly fine for a junior (or even some senior engineers) but it is critical for a principal engineer to communicate. Let’s break this down to a set of specific skills that you’d need to sharpen.

You need to be able to write effectively in order to convey your ideas to others. This includes knowing your audience and wording it in such a way that readers can understand. A technical design document whose audience are engineers is not written the same way as a design proposal whose audience are product and business managers.

You need to be able to publicly present and talk about various projects that you are working on. This includes creation of slide decks with good visuals and distilling down months of work to just a couple of slides. The best way of learning this is to get out there and keep presenting your work – you will get better over time.

You also need to be able to drive meetings and discussions without wasting anyone’s time. As a technical leader, one of your key responsibilities is to get people moving in the same direction and driving consensus during meetings.

Teaching and Learning

“You educate other engineers, both at an individual level and at scale: keeping the engineering community up to date on advanced technical issues, technologies, and trends. Examples include onboarding bootcamps for new hires, interns, specific skill-gap training development, and sharing specialized knowledge to raise the technical bar for other engineers/teams/dev centers.”

 

A principal engineer is a technical leader and as a leader you have the responsibility to mentor, coach fellow engineers, regardless of their level. In addition to code-reviews, you can organise office hours in your team and knowledge sharing sessions where everyone could present something. You could also help with bootcamps and help new hires in getting up-to-speed.

Most importantly, you will also need to keep learning whichever way works for you – reading journals and papers, blog posts, watching video-recorded talks, attending conferences and browsing through a variety of open-source projects. You will also learn from other Grabbers as even a junior engineer can teach you something, we all have our strengths and weaknesses. Keep improving and working on yourself!

Track your work easily with the latest changes to project boards

Post Syndicated from Lauren Brose original https://github.blog/2019-09-25-project-board-improvements/

We want to make it easy for you to track your work with GitHub Projects, but noticed a few issues get in the way of a seamless experience. Project automation helps keep your issues in sync with your development process but created a problem with your column prioritization by placing moved issues at the top of the list. We also updated the issue sidebar to help show you where an item was in your project board flow but required you to navigate away to make any changes.

Issue sidebar updates

The projects section of the issue sidebar was updated to better convey related project information. Closed projects are now collapsed in the sidebar by default so you can focus on the most relevant information. Additionally, you can change the issue’s project column directly from the issue page without needing to navigate away.

Automation placement

Now, any issues moved via automation or the inline sidebar will move to the bottom of the destination column. However, any “Done” automation triggers will continue to move issues to the top of the column so that your team can always see what was recently finished.

Let us know what you think

We hope this change makes your workflow smoother, but we can always make changes to improve. Let us know how the issue sidebar changes are working for you, and if there’s anything we can do to help.

Learn more about project boards

The post Track your work easily with the latest changes to project boards appeared first on The GitHub Blog.

Accelerating the GitHub Sponsors beta with Stripe Connect

Post Syndicated from Katie Delfin original https://github.blog/2019-09-10-accelerating-the-github-sponsors-beta/

Update: GitHub Sponsors is available in every country where GitHub does business, not just the 30 countries supported by Stripe Connect. We’ll continue to use our existing manual payout system for anyone outside of Connect’s list of currently supported countries.


Back in May, we announced GitHub Sponsors, a new way to support the developers who build and maintain the open source software you use every day. We launched in beta as early as possible so we could work closely with the community to address feedback and understand their needs before adding more developers to the program. Today, we’re taking the next step in accelerating the availability of GitHub Sponsors through a new streamlined onboarding and payment experience with Stripe Connect.

The early days of manual payments

Within hours of announcing GitHub Sponsors, thousands of people signed up for the waitlist from all over the world. It was awesome to see so much excitement for the program, but we knew this meant we had a lot of work ahead of us. 

We started the beta with a small group of sponsored developers, and every few weeks, we invited more into the program. Everything was a manual process—setting up account information, verifying identity, waiting for approval, running reports, and processing payouts across multiple departments. Due to the manual nature of the process, we were limited to a small number of people in the program until we could automate and streamline our operations.

Onboarding more developers, faster

At GitHub, we do business with companies of all sizes, from Fortune 50 companies to early stage startups from all over the world. Collecting revenue globally is something we’ve done for years, but with GitHub Sponsors, it’s not just about receiving money—it’s also about paying it out. This means providing a seamless and secure experience for sponsored developers to verify their identities, enter banking information, receive funds, and manage payouts. We also know that, for many maintainers, this will be the first time they’ve been financially rewarded for their contributions to open source. We want to ensure the process is easy for developers to complete, so they can spend more time doing what they love, like contributing to open source.

Stripe Connect Express offers a simple, low overhead onboarding experience that complies with web accessibility guidelines, coupled with the ability to localize the experience to simplify payouts in countries with unique tax laws and compliance regulations. And by not building our own onboarding solution, we’re able to save months of engineering and maintenance time, and start onboarding more sponsored developers today.

Localization

One of Stripe Connect’s features that’s helped us deliver a great experience to our users is their localization support. Stripe just released international support for 30 countries (with more to come) for Connect Express. With this expanded support, Stripe takes care of adjusting for localized rules and regulations for supported countries—no small feat. We couldn’t be more thrilled to be one of the first to offer support across all 30 countries to serve our global community.

Our maintainer onboarding process can now start to keep up with the growing demand of the program. With Stripe, users can onboard using Connect Express, reducing onboarding time from a week-long process to under five minutes.  

And there’s much more to come. While we’re just under four months into the program, we’re focused on making GitHub Sponsors available to even more developers around the world.

Sign up to join the beta—you’ll hear from us soon.

Learn more about GitHub Sponsors

The post Accelerating the GitHub Sponsors beta with Stripe Connect appeared first on The GitHub Blog.

Running GitHub on Rails 6.0

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2019-09-09-running-github-on-rails-6-0/

On August 26, 2019, the GitHub application was deployed to production with 100 percent of traffic on the newest Rails version: 6.0. This change came just 1.5 weeks after the final release of Rails 6.0. Rails upgrades aren’t always something companies announce, but looking back at GitHub’s history of being on a custom fork of Rails 3.2, this upgrade is a big deal. It represents how far we’ve come in the last few years, and the hard work and dedication from our upgrade team made it smoother, easier, and faster than any of our previous upgrades.

At GitHub, we have a lot to celebrate with the release of Rails 6.0 and the subsequent production deploy. First, we were more involved in this release than we have been in any previous release of Rails. GitHub engineers sent over 100 pull requests to Rails 6.0 to improve documentation, fix bugs, add features, and speed up performance. For many GitHub contributors, this was the first time sending changes to the Rails framework, demonstrating that upgrading Rails not only helps GitHub internally, but also improves our developer community as well.

Second, we deployed Rails 6.0 to production without any negative impact to customers—we had only one Rails 6.0 exception occur during testing, and it was hit by a bot! We were able to achieve this level of stability for the upgrade because we were heavily involved with its development. As soon as we finished the Rails 5.2 upgrade last year, we started upgrading our application to Rails 6.0.

Instead of waiting for the final release, we’d upgrade every week by pulling in the latest changes from Rails master and run all of our tests against that new version. This allowed us to find regressions quickly and early—often finding regressions in Rails master just hours after they were introduced. Upgrading weekly made it easy to find where these regressions were introduced since we were bisecting Rails with only a week’s worth of commits instead of more than a year of commits. Once our build for Rails 6.0 was green, we’d merge the pull request to master, and all new code that went into GitHub would need to pass in Rails 5.2 and the newest master build of Rails. Upgrading every week worked so well that we’ll continue using this process for upgrading from 6.0 to 6.1.

In addition to ensuring that Rails 6.0 was stable, we also contributed to the new features of the framework like parallel testing and multiple databases. The code for these tools is used in our production application every day—it’s well-tested, GitHub-approved code in a public, open source framework. By upstreaming this tooling, we’re able to reduce complexity in our code base and set a standard where so many companies once had to implement this functionality on their own.

There are so many wins to staying upgraded that go beyond more security, faster performance, and new features. By staying current with Rails master, we’re influencing the future of the framework to meet our needs and giving back to the open source community in big ways. This process means that the GitHub code base evolves alongside Rails instead of in response to Rails. Investing in our application by staying up to date with the Rails framework has had a tremendous positive effect on our code base and engineering teams. Staying current allows us to invest in our community, invest in our tools for the long term, and improve the experience of working with the GitHub code base for our engineers.

Keep a look out for all the great stuff we’ll be contributing to Rails 6.1 and beyond—this is just the beginning.

The post Running GitHub on Rails 6.0 appeared first on The GitHub Blog.

Data first, SLA always

Post Syndicated from Grab Tech original https://engineering.grab.com/data-first-sla-always

Introducing Trailblazer, the Data Engineering team’s solution to implementing change data capture of all upstream databases. In this article, we introduce the reason why we needed to move away from periodic batch ingestion towards a real time solution and show how we achieved this through an end to end streaming pipeline.

Context

Our mission as Grab’s Data Engineering team is to fulfill 100% of SLAs for data availability to our downstream users. Our 40 person team is responsible for providing accurate and reliable data to data analysts and data scientists so that they can produce actionable reports that will help Grab’s leadership team make data-driven decisions. We maintain data for a variety of business intelligence tools such as Tableau, Presto and Holistics as well as predictive algorithms for all of Grab.

We ingest data from multiple upstream sources, such as relational databases, Kafka or third party applications such as Salesforce or Zendesk. The majority of these source data exists in MySQL and we run ETL pipelines to mirror any updates into our data lake. These pipelines are triggered on an hourly or daily basis and are powered by an in-house Loader application which performs Spark batch ingestion and loading of data from source to sink.

Problems with the Loader application started to surface when Grab’s data exceeded the petabyte threshold. As such for larger tables, the most practical method to ingest data was to perform ETL only on rows that were updated within a specified timeframe. This is akin to issuing the query

SELECT * FROM table WHERE updated >= [start_time] AND updated < [end_time]

Now imagine two situations. One, firing this query to a huge table without an updated field. Two, firing the same query to the huge table, this time without indexes on the updated field. In the first scenario, the query will never work and we can never perform incremental ingestion on the table based on a timed window. The second scenario carries the dangers of creating high CPU load to replicate the database that we are querying from. Neither has an ideal outcome.

One other problem that we identified was the unpredictability of growth in data volume. Tables smaller than one gigabyte were ingested by fully scanning the table and overwriting the data in the data lake. This worked out well for us until the table size increased exponentially, at which point our Spark jobs failed due to JDBC timeouts. If we were only dealing with a handful of tables, this issue could have been addressed by switching our data ingestion strategy from full scan to a timed window.

When assessing the issue, we discovered that there were hundreds of tables running under the full scan strategy, all of them potentially crashing our data system, all time bombs silently waiting to explode.

The team urgently needed a new approach to ETL. Our Loader application was highly coupled to upstream table characteristics. We needed to find solutions that were truly scalable, which meant decoupling our pipelines from the upstream.

Change data capture (CDC)

Much like event sourcing, any log change to the database is captured and streamed out for downstream applications to consume. This process is lightweight since any row level update to the table is instantly captured by a real time processor, avoiding the need for large chunked queries on the table. In addition, CDC works regardless of upstream table definition, so we do not need to worry about missing updated columns impacting our data migration process.

Binary Logs (binlogs) are the CDC agents of MySQL. All updates, insertions or deletions performed on the table are captured as a series of logged events containing the past state of the row and it’s newly modified state. Check out the binlogs reference to find out more.

In order to persist all binlogs generated upstream, our team created a Spark Structured Streaming application called Trailblazer. Trailblazer streams all MySQL binlogs to our data lake. These binlogs serve as a foundation for us to build Presto tables for data auditing and help to remove the direct dependency of our batch ETL jobs to the source MySQL.

Trailblazer is an amalgamation of various data streaming stacks. Binlogs are captured by Debezium which runs on Kafka connect clusters. All binlogs are sent to our Kafka cluster, which is managed by the Data Engineering Infrastructure team and are streamed out to a real time bucket via a Spark structured streaming application. Hourly or daily ETL compaction jobs ingests the change logs from the real time bucket to materialize tables for downstream users to consume.

CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services
CDC in action where binlogs are streamed to Kafka via Debezium before being consumed by Trailblazer streaming & compaction services

 

Some statistics

To date, we are streaming hundreds oftables across 60 Spark streaming jobs and with the constant increase in Grab’s database instances, the numbers are expected to keep growing.

Designing Trailblazer streams

We built our streaming application using Spark structured streaming 2.3. Structured streaming was designed to remove the technical aspects of provisioning streams. Developers can focus on perfecting business logic without worrying about fundamentals such as checkpoint management or reading and writing to data sources.

Key architecture for Trailblazer streaming
Key architecture for Trailblazer streaming

 

In the design phase, we made sure to follow several key principles that helped in managing our streams.

Checkpoints have to be externally managed

Structured streaming manages checkpoints both in a local directory and in a ‘_metadata’ directory on S3 buckets, such that the state of the stream can be restored in the event of failure and restart.

This is all well and good, with two exceptions. First, changing the starting point of data ingestion meant ssh-ing into the machine and manipulating metadata, which could be extremely dangerous. Second, we could not assume cluster prevalence since clusters can die and be recreated with data erased from its local disk or the distributed file system.

Our solution was to do a work around at the application level. All checkpoints will be stored in temporary directories with the existing timestamp appended as path (eg /tmp/checkpoint/job_A/1560697200/… ). A linearly progressive timestamp guarantees that the same directory will never be reused by new instances of the stream. This explains why we never restore its state from local disk but instead, store all checkpoints in a highly available Redis cluster, with key as the Kafka topic and value as a JSON of partition : offset.

Key

debz-schema-A.schema_A.table_B

Value

{"11":19183566,"12":19295602,"13":18992606[[a]](#cmnt1)[[b]](#cmnt2)[[c]](#cmnt3)[[d]](#cmnt4)[[e]](#cmnt5)[[f]](#cmnt6),"14":19269499,"15":19197199,"16":19060873,"17":19237853,"18":19107959,"19":19188181,"0":19193976,"1":19072585,"2":19205764,"3":19122454,"4":19231068,"5":19301523,"6":19287447,"7":19418871,"8":19152003,"9":19112431,"10":19151479}
Example of how offsets are stored in Redis as Key : Value pairs

 

Fortunately, structured streaming provides the StreamQueryListener class which we can use to register checkpoints after the completion of each microbatch.

Streams must handle 0, 1 or 1 million data

Scalability is at the heart of all well-designed applications. Spark streaming jobs are built for scalability in the face of varying data volumes.

In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing
In general, the rate of messages input to Kafka is cyclical across 24 hrs. Streaming jobs should be robust enough to handle data loads during peak hours of the day without breaching microbatch timing

 

There are a few settings that we can configure to influence the degree of scalability for a streaming app

  • spark.dynamicAllocation.enabled=true gives spark autonomy to provision / revoke executors to suit the workload
  • spark.dynamicAllocation.maxExecutors controls the maximum job parallelism
  • maxOffsetsPerTrigger controls the maximum number of messages ingested from Kafka per microbatch
  • trigger controls the duration between microbatchs and is a property of the DataStreamWriter class

Data as key health indicator

Scaling the number of streaming jobs without prior collection of performance metrics is a bad idea. There is a high chance that you will discover a dead stream when checking your stream hours after initialization. I’ll cite Murphy’s law as proof.

Thus we vigilantly monitored our data streams. We used tools such as Datadog for metric monitoring, Slack for oncall issue reporting, PagerDuty for urgent cases and our inhouse data auditor as a service (DASH) for counts discrepancy reporting between streamed and source data. More details on monitoring will be discussed in the later part.

Streams are ephemeral

Streams may die due to a hundred and one reasons so don’t blame yourself or your programming insecurities. Issues with upstream dependencies, such as a node within your Kafka cluster running out of disk space, could lead to partition unavailability which would crash the application. On one occasion, our streaming application was unable to resolve DNS when writing to AWS S3 storage. This amounted to multiple failures within our Spark job that eventually culminated in the termination of the stream.

In this case, allow the stream to  shutdown gracefully, send out your alerts and have a mechanism in place to retry the failed stream. We run all streaming jobs on Airflow and any failure to the stream will automatically be retried through a new task issued by the scheduler.

If you have had experience with large scale management of streams, please leave a comment so we can continue this discussion!

Monitoring data streams

Here are some key features that were set up to monitor our streams.

Running : Active jobs ratio

The number of streaming jobs could increase in the future, thus becoming a challenge for the oncall team to track all jobs that are supposed to be up and running.

One proposal  is  to track the number of jobs in production against the number of jobs that are actually running. By querying MySQL tables, we can filter out all the jobs that are meant to be active. Since Trailblazer streams are spark-submit jobs managed by YARN, we can query YARN’s resource manager REST API to retrieve  all the jobs that are running. We then construct a ratio of running : active jobs and report them to Datadog. If the ratio is not 1 for an extended duration, an alert will be issued for the oncall to take action.

If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert
If the ratio of running : active jobs falls below 1 for a period of time, we will immediately trigger an alert

 

Microbatch runtime

We define a 30 second window for each microbatch and track the actual runtime using metrics reported by the query listener. A runtime that exceeds the designated window is a potential indicator that the streaming job is deprived of resources and needs to be scaled up.

Job liveliness

Each job reports its health by emitting a count of 1 heartbeat. This heartbeat is created at the end of every microbatch via a query listener. This process is useful in detecting stale jobs (jobs that are registered as RUNNING in YARN but are actually hung).

Kafka offset divergence

In order to ensure that the message output rate to the consumer exceeds the message input rate from the producer, we sum up all presently ingested topic-partition offsets and compare that value to the sum of all topic-partition end offsets in Kafka. We then add an alerting logic on top of these metrics to inform the oncall team if the difference between the two values grows too big.

It is important to track the offset divergence parameter as streams can be lagging. Should the rate of consumption fall below the rate of message production, we would run the risk of falling short of Kafka’s retention window, leading to data losses.

Hourly data checks

DASH runs hourly and serves as our first line of defence to detect any data quality issues within the streams. We issue queries to the source database and our streaming layer to confirm that the ID counts of data created within the last hour match.

DASH helps in the early detection of upstream issues. We have noticed cases where our Debezium connectors failed and our checker reported fewer data than expected since there were no incoming messages to Kafka.

DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack
DASH matches and mismatches reported to Slack

 

Materializing tables through compaction

Having CDC data in our data lake does not conclude our responsibilities. Batched compaction allows us to apply all captured CDC, to be available as Presto tables for downstream consumption. The job is set to trigger hourly and process all changes to the database within the past hour.  For example, changes to a record are visible in real-time, but the latest state of the record will not be reflected until the next time a batch job runs. We addressed several issues with streaming during this phase.

Deduplication of data

Trailblazer was not built to deliver exactly once guarantees. We ensure that the issues regarding duplicated CDCs are addressed during compaction.

Availability of all data until certain hour

We want to make sure that downstream pipelines use output data of the hourly batch job only when the pipeline has all records for that hour. In case there is an event that is processed late by streaming, the current pipeline will wait until the data is completed. In this case, we are consciously choosing consistency over availability for our downstream users. For example, missing a few insert booking records in peak hours due to consumer processing delay can generate the wrong downstream results leading to miscalculation in revenue. We want to start  downstream processes only when the data for the hour or day is complete.

Need for latest state of each event

Our compaction job performs upserts on the data to ensure that our downstream users can consume  records in their latest state.  

Future applications

Trailblazer is a milestone for the Data Engineering team as it represents our commitment to achieve large scale data streams to reduce latencies for our end users. Moving ahead, our team will be exploring how we can further optimize streaming jobs by analysing data trends over time and to build applications such as snapshot tables on top of the CDCs being streamed in our data lake.

How We Built A Logging Stack at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/how-built-logging-stack

And Solved Our Inhouse Logging Problem

Problem:

Let me take you back a year ago at Grab. When we lacked any visualizations or metrics for our service logs. When performing a query for a string from the last three days was something only run before you went for a beverage.

When a service stops responding, Grab’s core problems were and are:

  • We need to know it happened before the customer does.
  • We need to know why it happened.
  • We need to solve our customers’ problems fast.

We had a hodgepodge of log-based solutions for developers when they needed to figure out the above, or why a driver never showed up, or a customer wasn’t receiving our promised promotions. These included logs in a cloud based storage service (which could take hours to retrieve). Or a SAS provider constantly timing out on our queries. Or even asking our SREs to fetch logs from the potential machines for the service engineer, a rather laborious process.

Here’s what we did with our logs to solve these problems.

Issues:

Our current size and growth rate ruled out several available logging systems. By size, we mean a LOT of data and a LOT of users who search through hundreds of billions of logs to generate reports. Or who track down that one user who managed to find that pesky corner case in our code.

When we started this project, we generated 25TB of logging data a day. Our first thought was “Do we really need all of these logs?”. To this day our feeling is “probably not”.

However, we can’t always define what another developer can and cannot do. Besides, this gave us an amazing opportunity to build something to allow for all that data!

Some of our SREs had used the ELK Stack (Elasticsearch / Logstash / Kibana). They thought it could handle our data and access loads, so it was our starting point.

How We Built a Multi-Petabyte Cluster:

Information Gathering:

It started with gathering numbers. How much data did we produce each day? How many days were retained? What’s a reasonable response time to wait for?

Before starting a project, understand your parameters. This helps you spec out your cluster, get buy-in from higher ups, and increase your success rate when rolling out a product used by the entire engineering organization. Remember, if it’s not better than what they have now, why will they switch?

A good starting point was opening the floor to our users. What features did they want? If we offered a visualization suite so they can see ERROR event spikes, would they use it? How about alerting them about SEGFAULTs? Hands down the most requested feature was speed; “I want an easy webUI that shows me the user ID when I search for it, and get all the results in <5 seconds!”

Getting Our Feet Wet:

New concerns always pop up during a project. We’re sure someone has correlated the time spent in R&D to the number of problems. We had an always moving target, since as our proof of concept began, our daily logger volume kept increasing.

Thankfully, using Elasticsearch as our data store meant we could fully utilize horizontal scaling. This let us start with a simple 5 node cluster as we built out our proof-of-concept (POC). Once we were ready to onboard more services, we could move into a larger footprint.

The specs at the time called for about 80 nodes to handle all our data. But if we designed our system correctly, we’d only need to increase the number of Elasticsearch nodes as we enrolled more customers. Our key operating metrics were CPU utilization, heap memory needed for the JVM, and total disk space.

Initial Design:

First, we set up tooling to use Ansible both to launch a machine and to install and configure Elasticsearch. Then we were ready to scale.

Our initial goal was to keep the design as simple as possible. Opting to allow each node in our cluster to perform all responsibilities. In this setup each node would behave as all of the four available types:

  • Ingest: Used for transforming and enriching documents before sending them to data nodes for indexing.
  • Coordinator: Proxy node for directing search and indexing requests.
  • Master: Used to control cluster operations and determine a quorum on indexed documents.
  • Data: Nodes that hold the indexed data.

These were all design decisions made to move our proof of concept along, but in hindsight they might have created more headaches down the road with troubleshooting, indexing speed, and general stability. Remember to do your homework when spec’ing out your cluster.

It’s challenging to figure out why you are losing master nodes because someone filled up the field data cache performing a search. Separating your nodes can be a huge help in tracking down your problem.

We also decided to further reduce complexity by going with ingest nodes over Logstash. But at the time, the documentation wasn’t great so we had a lot of trial and error in figuring out how they work. Particularly as compared to something more battle tested like Logstash.

If you’re unfamiliar with ingest node design, they are lightweight proxies to your data nodes that accept a bulk payload, perform post-processing on documents,and then send the documents to be indexed by your data nodes. In theory, this helps keep your entire pipeline simple. And in Elasticsearch’s defense, ingest nodes have made massive improvements since we began.

But adding more ingest nodes means ADDING MORE NODES! This can create a lot of chatter in your cluster and cause more complexity when  troubleshooting problems. We’ve seen when an ingest node failing in an odd way caused larger cluster concerns than just a failed bulk send request.

Monitoring:

This isn’t anything new, but we can’t overstate the usefulness of monitoring. Thankfully, we already had a robust tool called Datadog with an additional integration for Elasticsearch. Seeing your heap utilization over time, then breaking it into smaller graphs to display the field data cache or segment memory, has been a lifesaver. There’s nothing worse than a node falling over due to an OOM with no explanation and just hoping it doesn’t happen again.

At this point, we’ve built out several dashboards which visualize a wide range of metrics from query rates to index latency. They tell us if we sharply drop on log ingestion or if circuit breakers are tripping. And yes, Kibana has some nice monitoring pages for some cluster stats. But to know each node’s JVM memory utilization on a 400+ node cluster, you need a robust metric system.

Pitfalls:

Common Problems:

There are many blogs about the common problems encountered when creating an Elasticsearch cluster and Elastic does a good job of keeping blog posts up to date. We strongly encourage you to read them. Of course, we ran into classic problems like ensuring our Java objects were compressed (Hints: Don’t exceed 31GB of heap for your JVM and always confirm you’ve enabled compression).

But we also ran into some interesting problems that were less common. Let’s look at some major concerns you have to deal with at this scale.

Grab’s Problems:

Field Data Cache:

So, things are going well, all your logs are indexing smoothly, and suddenly you’re getting Out Of Memory (OOMs) events on your data nodes. You rush to find out what’s happening, as more nodes crash.

A visual representation of your JVM heap’s memory usage is very helpful here. You can always hit the Elasticsearch API, but after adding more then 5 nodes to your cluster this kind of breaks down. Also, you don’t want to know what’s going on while a node is down, but what happened before it died.

Using our graphs, we determined the field data cache went from virtually zero memory used in the heap to 20GB! This forced us to read up on how this value is set, and, as of this writing, the default value is still 100% of the parent heap memory. Basically, this breaks down to allowing 70% of your total heap being allocated to a single search in the form of field data.

Now, this should be a rare case and it’s very helpful to keep the field names and values in memory for quick lookup. But, if, like us, you have several trillion documents, you might want to watch out.

From our logs, we tracked down a user who was sorting by the _id field. We believe this is a design decision in how Kibana interacts with Elasticsearch. A good counter argument would be a user wants a quick memory lookup if they search for a document using the _id. But for us, this meant a user could load into memory every ID in the indices over a 14 day period.

The consequences? 20+GB of data loaded into the heap before the circuit breaker tripped. It then only took 2 queries at a time to knock a node over.

You can’t disable indexing that field, and you probably don’t want to. But you can prevent users from stumbling into this and disable the _id field in the Kibana advanced settings. And make sure you re-evaluate your circuit breakers. We drastically lowered the available field cache and removed any further issues.

Translog Compression:

At first glance, compression seems an obvious choice for shipping shards between nodes. Especially if you have the free clock cycles, why not minimize the bandwidth between nodes?

However, we found compression between nodes can drastically slow down shard transfers. By disabling compression, shipping time for a 50GB shard went from 1h to 20m. This was because Lucenesegments are already compressed, a new issue we ran into full force and are actively working with the community to fix. But it’s also a configuration to watch out for in your setup, especially if you want a fast recovery of a shard.

Segment Memory:

Most of our issues involved the heap memory being exhausted. We can’t stress enough the importance of having visualizations around how the JVM is used. We learned this lesson the hard way around segment memory.

This is a prime example of why you need to understand your data when building a cluster. We were hitting a lot of OOMs and couldn’t figure out why. We had fixed the field cache issue, but what was using all our RAM?

There is a reason why having a 16TB data node might be a poorly spec’d machine. Digging into it, we realized we simply allocated too many shards to our nodes. Looking up the total segment memory used per index should give a good idea of how many shards you can put on a node before you start running out of heap space. We calculated on average our 2TB indices used about 5GB of segment memory spread over 30 nodes.

The numbers have since changed and our layout was tweaked, but we came up with calculations showing we could allocate about 8TB of shards to a node with 32GB heap memory before we running into issues. That’s if you really want to push it, but it’s also a metric used to keep your segment memory per node around 50%. This allows enough memory to run queries without knocking out your data nodes. Naturally this led us to ask “What is using all this segment memory per node?”

Index Mapping and Field Types:

Could we lower how much segment memory our indices used to cut our cluster operation costs? Using the segments data found in the ES cluster and some simple Python loops, we tracked down the total memory used per field in our index.

We used a lot of segment memory for the _id field (but can’t do much about that). It also gave us a good breakdown of our other fields. And we realized we indexed fields in completely unnecessary ways. A few fields should have been integers but were keyword fields. We had fields no one would ever search against and which could be dropped from index memory.

Most importantly, this began our learning process of how tokens and analyzers work in Elasticsearch/Lucene.

Picking the Wrong Analyzer:

By default, we use Elasticsearch’s Standard Analyzer on all analyzed fields. It’s great, offering a very close approximation to how users search and it doesn’t explode your index memory like an N-gram tokenizer would.

But it does a few things we thought unnecessary, so we thought we could save a significant amount of heap memory. For starters, it keeps the original tokens: the Standard Analyzer would break IDXVB56KLM into tokens IDXVB, 56,  and KLM. This usually works well, but it really hurts you if you have a lot of alphanumeric strings.

We never have a user search for a user ID as a partial value. It would be more useful to only return the entire match of an alphanumeric string. This has the added benefit of only storing the single token in our index memory. This modification alone stripped a whole 1GB off our index memory, or at our scale meant we could eliminate 8 nodes.

We can’t stress enough how cautious you need to be when changing analyzers on a production system. Throughout this process, end users were confused why search results were no longer returning or returning weird results. There is a nice kibana pluginthat gives you a representation of how your tokens look with a different analyzer, or use the build in ES tools to get the same understanding.

Be Careful with Cloud Maintainers:

We realized that running a cluster at this scale is expensive. The hardware alone sets you back a lot, but our hidden bigger cost was cross traffic between availability zones.

Most cloud providers offer different “zones” for your machines to entice you to achieve a High-Availability environment. That’s a very useful thing to have, but you need to do a cost/risk analysis. If you migrate shards from HOT to WARM to COLD nodes constantly, you can really rack up a bill. This alone was about 30% of our total cluster cost, which wasn’t cheap at our scale.

We re-worked how our indices sat in the cluster. This let us create a different index for each zone and pin logging data so it never left the zone it was generated in. One small tweak to how we stored data cut our costs dramatically. Plus, it was a smaller scope for troubleshooting. We’d know a zone was misbehaving and could focus there vs. looking at everything.

Conclusion:

Running our own logging stack started as a challenge. We roughly knew the scale we were aiming for; it wasn’t going to be trivial or easy. A year later, we’ve gone from pipe-dream to production and immensely grown the team’s ELK stack knowledge.

We could probably fill 30 more pages with odd things we ran into, hacks we implemented, or times we wanted to pull our hair out. But we made it through and provide a superior logging platform to our engineers at a significant price reduction while maintaining a stable platform.

There are many different ways we could have started knowing what we do now. For example, using Logstash over Ingest nodes, changing default circuit breakers, and properly using heap space to prevent node failures. But hindsight is 20/20 and it’s rare for projects to not change.

We suggest anyone wanting to revamp their centralized logging system look at the ELK solutions. There is a learning curve, but the scalability is outstanding and having subsecond lookup time for assisting a customer is phenomenal. But, before you begin, do your homework to save yourself weeks of troubleshooting down the road. In the end though, we’ve received nothing but praise from Grab engineers about their experiences with our new logging system.

Catwalk: Serving Machine Learning Models at Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/catwalk-serving-machine-learning-models-at-scale

Introduction

Grab’s unwavering ambition is to be the best Super App in Southeast Asia that adds value to the everyday for our customers. In order to achieve that, the customer experience must be flawless for each and every Grab service. Let’s take our frequently used ride-hailing service as an example. We want fair pricing for both drivers and passengers, accurate estimation of ETAs, effective detection of fraudulent activities, and ensured ride safety for our customers. The key to perfecting these customer journeys is artificial intelligence (AI).

Grab has a tremendous amount of data that we can leverage to solve complex problems such as fraudulent user activity, and to provide our customers personalized experiences on our products. One of the tools we are using to make sense of this data is machine learning (ML).

As Grab made giant strides towards increasingly using machine learning across the organization, more and more teams were organically building model serving solutions for their own use cases. Unfortunately, these model serving solutions required data scientists to understand the infrastructure underlying them. Moreover, there was a lot of overlap in the effort it took to build these model serving solutions.

That’s why we came up with Catwalk: an easy-to-use, self-serve, machine learning model serving platform for everyone at Grab.

Goals

To determine what we wanted Catwalk to do, we first looked at the typical workflow of our target audience – data scientists at Grab:

  • Build a trained model to solve a problem.
  • Deploy the model to their project’s particular serving solution. If this involves writing to a database, then the data scientists need to programmatically obtain the outputs, and write them to the database. If this involves running the model on a server, the data scientists require a deep understanding of how the server scales and works internally to ensure that the model behaves as expected.
  • Use the deployed model to serve users, and obtain feedback such as user interaction data. Retrain the model using this data to make it more accurate.
  • Deploy the retrained model as a new version.
  • Use monitoring and logging to check the performance of the new version. If the new version is misbehaving, revert back to the old version so that production traffic is not affected. Otherwise run an AB test between the new version and the previous one.

We discovered an obvious pain point – the process of deploying models requires additional effort and attention, which results in data scientists being distracted from their problem at hand. Apart from that, having many data scientists build and maintain their own serving solutions meant there was a lot of duplicated effort. With Grab increasingly adopting machine learning, this was a state of affairs that could not be allowed to continue.

To address the problems, we came up with Catwalk with goals to:

  1. Abstract away the complexities and expose a minimal interface for data scientists
  2. Prevent duplication of effort by creating an ML model serving platform for everyone in Grab
  3. Create a highly performant, highly available, model versioning supported ML model serving platform and integrate it with existing monitoring systems at Grab
  4. Shorten time to market by making model deployment self-service

What is Catwalk?

In a nutshell, Catwalk is a platform where we run Tensorflow Serving containers on a Kubernetes cluster integrated with the observability stack used at Grab.

In the next sections, we are going to explain the two main components in Catwalk – Tensorflow Serving and Kubernetes, and how they help us obtain our outlined goals.

What is Tensorflow Serving?

Tensorflow Serving is an open-source ML model serving project by Google. In Google’s own words, “Tensorflow Serving is a flexible, high-performance serving system for machine learning models, designed for production environments. It makes it easy to deploy new algorithms and experiments, while keeping the same server architecture and APIs. Tensorflow Serving provides out-of-the-box integration with Tensorflow models, but can be easily extended to serve other types of models and data.”

Why Tensorflow Serving?

There are a number of ML model serving platforms in the market right now. We chose Tensorflow Serving because of these three reasons, ordered by priority:

  1. Highly performant. It has proven performance handling tens of millions of inferences per second at Google according to their website.
  2. Highly available. It has a model versioning system to make sure there is always a healthy version being served while loading a new version into its memory
  3. Actively maintained by the developer community and backed by Google

Even though, by default, Tensorflow Serving only supports models built with Tensorflow, this is not a constraint, though, because Grab is actively moving toward using Tensorflow.

How are we using Tensorflow Serving?

In this section, we will explain how we are using Tensorflow Serving and how it helps abstract away complexities for data scientists.

Here are the steps showing how we are using Tensorflow Serving to serve a trained model:

  1. Data scientists export the model using tf.saved_model API and drop it to an S3 models bucket. The exported model is a folder containing model files that can be loaded to Tensorflow Serving.
  2. Data scientists are granted permission to manage their folder.
  3. We run Tensorflow Serving and point it to load the model files directly from the S3 models bucket. Tensorflow Serving supports loading models directly from S3 out of the box. The model is served!
  4. Data scientists come up with a retrained model. They export and upload it to their model folder.
  5. As Tensorflow Serving keeps watching the S3 models bucket for new models, it automatically loads the retrained model and serves. Depending on the model configuration, it can either gracefully replace the running model version with a newer version or serve multiple versions at the same time.
Tensorflow Serving Diagram

The only interface to data scientists is a path to their model folder in the S3 models bucket. To update their model, they upload exported models to their folder and the models will automatically be served. The complexities are gone. We’ve achieved one of the goals!

Well, not really…

Imagine youare going to run Tensorflow Serving to serve one model in a cloud provider, which means you  need a compute resource from a cloud provider to run it. Running it on one box doesn’t provide high availability, so you need another box running the same model. Auto scaling is also needed in order to scale out based on the traffic. On top of these many boxes lies a load balancer. The load balancer evenly spreads incoming traffic to all the boxes, thus ensuring that there is a single point of entry for any clients, which can be abstracted away from the horizontal scaling. The load balancer also exposes an HTTP endpoint to external users. As a result, we form a Tensorflow Serving cluster that is ready to serve.

Next, imagine you have more models to deploy. You have three options

  1. Load the models into the existing cluster – having one cluster serve all models.
  2. Spin up a new cluster to serve each model – having multiple clusters, one cluster serves one model.
  3. Combination of 1 and 2 – having multiple clusters, one cluster serves a few models.

The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.

The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.

The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.

Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that  is exactly what Kubernetes is meant to do!

So what is Kubernetes?

Kubernetes abstracts a cluster of physical/virtual hosts (such as EC2) into a cluster of logical hosts (pods in Kubernetes terms). It provides a container-centric management environment. It orchestrates computing, networking, and storage infrastructure on behalf of user workloads.

Let’s look at some of the definitions of Kubernetes resources

Tensorflow Serving Diagram
  • Cluster – a cluster of nodes running Kubernetes.
  • Node – a node inside a cluster.
  • Deployment – a configuration to instruct Kubernetes the desired state of an application. It also takes care of rolling out an update (canary, percentage rollout, etc), rolling back and horizontal scaling.
  • Pod – a single processing unit. In our case, Tensorflow Serving will be running as a container in a pod. Pod can have CPU/memory limits defined.
  • Service – an abstraction layer that abstracts out a group of pods and exposes the application to clients.
  • Ingress – a collection of routing rules that govern how external users access services running in a cluster.
  • Ingress Controller – a controller responsible for reading the ingress information and processing that data accordingly such as creating a cloud-provider load balancer or spinning up a new pod as a load balancer using the rules defined in the ingress resource.

Essentially, we deploy resources to instruct Kubernetes the desired state of our application and Kubernetes will make sure that it is always the case.

How are we using Kubernetes?

In this section, we will walk you through how we deploy Tensorflow Serving in Kubernetes cluster and how it makes managing model deployments very convenient.

We used a managed Kubernetes service, to create a Kubernetes cluster and manually provisioned compute resources as nodes. As a result, we have a Kubernetes cluster with nodes that are ready to run applications.

An application to serve one model consists of

  1. Two or more Tensorflow Serving pods that serves a model with an autoscaler to scale pods based on resource consumption
  2. A load balancer to evenly spread incoming traffic to pods
  3. An exposed HTTP endpoint to external users

In order to deploy the application, we need to

  1. Deploy a deployment resource specifying
  2. Number of pods of Tensorflow Serving
  3. An S3 url for Tensorflow Serving to load model files
  4. Deploy a service resource to expose it
  5. Deploy an ingress resource to define an HTTP endpoint url

Kubernetes then allocates Tensorflow Serving pods to the cluster with the number of pods according to the value defined in deployment resource. Pods can be allocated to any node inside the cluster, Kubernetes makes sure that the node it allocates a pod into has sufficient resources that the pod needs. In case there is no node that has sufficient resources, we can easily scale out the cluster by adding new nodes into it.

In order for the rules defined inthe ingressresource to work, the cluster must have an ingress controller running, which is what guided our choice of the load balancer. What an ingress controller does is simple: it keeps checking the ingressresource, creates a load balancer and defines rules based on rules in the ingressresource. Once the load balancer is configured, it will be able to redirect incoming requests to the Tensorflow Serving pods.

That’s it! We have a scalable Tensorflow Serving application that serves a model through a load balancer! In order to serve another model, all we need to do is to deploy the same set of resources but with the model’s S3 url and HTTP endpoint.

To illustrate what is running inside the cluster, let’s see how it looks like when we deploy two applications: one for serving pricing model another one for serving fraud-check model. Each application is configured to have two Tensorflow Serving pods and exposed at /v1/models/model

Tensorflow Serving Diagram

There are two Tensorflow Serving pods that serve fraud-check model and exposed through a load balancer. Same for the pricing model, the only differences are the model it is serving and the exposed HTTP endpoint url. The load balancer rules for pricing and fraud-check model look like this

IfThen forward to
Path is /v1/models/pricingpricing pod ip-1
pricing pod ip-2
Path is /v1/models/fraud-checkfraud-check pod ip-1
fraud-check pod ip-2

Stats and Logs

The last piece is how stats and logs work. Before getting to that, we need to introduce DaemonSet. According to the document, DaemonSet ensures that all (or some) nodes run a copy of a pod. As nodes are added to the cluster, pods are added to them. As nodes are removed from the cluster, those pods are garbage collected. Deleting a DaemonSet will clean up the pods it created.

We deployed datadog-agent and filebeat as a DaemonSet. As a result, we always have one datadog-agent pod and one filebeat pod in all nodes and they are accessible from Tensorflow Serving pods in the same node. Tensorflow Serving pods emit a stats event for every request to datadog-agent pod in the node it is running in.

Here is a sample of DataDog stats:

DataDog stats

And logs that we put in place:

Logs

Benefits Gained from Catwalk

Catwalk has become the go-to, centralized system to serve machine learning models. Data scientists are not required to take care of the serving infrastructure hence they can focus on what matters the most: come up with models to solve customer problems. They are only required to provide exported model files and estimation of expected traffic in order to prepare sufficient resources to run their model. In return, they are presented with an endpoint to make inference calls to their model, along with all necessary tools for monitoring and debugging. Updating the model version is self-service, and the model improvement cycle is much shorter than before. We used to count in days, we now count in minutes.

Future Plans

Improvement on Automation

Currently, the first deployment of any model will still need some manual task from the platform team. We aim to automate this processentirely. We’ll work with our awesome CI/CD team who is making the best use of Spinnaker.

Model serving on mobile devices

As a platform, we are looking at setting standards for model serving across Grab. This includes model serving on mobile devices as well. Tensorflow Serving also provides a Lite version to be used on mobile devices. It is a whole new paradigm with vastly different tradeoffs for machine learning practitioners. We are quite excited to set some best practices in this area.

gRPC support

Catwalk currently supports HTTP/1.1. We’ll hook Grab’s service discovery mechanism to open gRPC traffic, which TFS already supports.

If you are interested in building pipelines for machine learning related topics, and you share our vision of driving South East Asia forward, come join us!

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Atom editor is now faster

Post Syndicated from Rafael Oleza original https://github.blog/2019-06-12-atom-editor-is-now-faster/

The Atom Team improved a few of Atom’s most common features—they’re now dramatically faster and ready to help you be even more productive.

Fuzzy Finder

The fuzzy finder is one of the most popular features in Atom, and it’s used by nearly every user to quickly open files by name. We wanted to make it even better by speeding up the process. Atom version 1.38 introduces an experimental fast mode that brings drastic speed improvements to any project. With this mode, indexing a medium or large project is roughly six times faster than when using the standard mode.

Medium to large projects are crawled six times quicker using fast mode.

This change also made the process to display filtered results in Atom about 11 times faster. You’ll notice the difference in speed as soon as you start searching for a query.

Medium to large projects are filtered 11 times quicker using fast mode.

What’s next

Our goal is to make the new experimental mode the default option on Atom version 1.39 when it’s released next month. Until then, switch to this mode by clicking the try experimental fast mode option in the fuzzy finder.

The next Atom version 1.39 will also introduce performance improvements to the find and replace package. We’re adding a new mode to make searching for files between 10 and 20 times faster than the current mode.

Thank you

We’d like to thank the open source community since it has allowed us to build and improve Atom over the years. More specifically, we’ve made these performance improvements by building on the shoulders of two big giants:

Try out Atom.io

The post Atom editor is now faster appeared first on The GitHub Blog.

Direct instruction marking in Ruby 2.6

Post Syndicated from Aaron Patterson original https://github.blog/2019-06-04-direct-instruction-marking-in-ruby-2-6/

We recently upgraded GitHub to use the latest version of Ruby 2.6. Ruby 2.6 contains an optimization for reducing memory usage. We’ve found it to reduce the “post-boot live heap” by about 3 percent. The “post-boot live heap” are the objects still referenced and not garbage collected after booting our Rails application, but before accepting any requests.

Ruby’s virtual machine

MRI (Matz’s Ruby Implementation) uses a stack-based virtual machine.  Each instruction manipulates a stack, and that stack can be thought of as a sort of a “scratch space”. For example, the program 3 + 5 could be represented with the instruction sequences:

push 3
push 5
add

As the virtual machine executes, each instruction manipulates the stack:

Ruby VM Stack Manipulation

Abstract Syntax Trees in Ruby

Before the virtual machine has something to execute, the code must go through a few different processing phases. The code is tokenized, parsed, turned in to an AST (or Abstract Syntax Tree), and finally the AST is converted to byte code. The byte code is what the virtual machine will eventually execute.

Let’s look at an example Ruby program:

"hello" + "world"

This code is turned into an AST, which is a tree data structure. Each node in the tree is represented internally by an object called a T_NODE object. The tree for this code will look like this:

Abstract Synatax Tree of hello + world

Some of the T_NODE objects reference literals. In this case some T_NODE objects reference the string literals “hello” and “world”. Those string literals are allocated via Ruby’s Garbage Collector. They are just like any other string in Ruby, except that they happen to be allocated as the code is being parsed.

Translating the AST to instructions

Instructions and their operands are represented as integers and can only be represented as integers. Instruction sequences for a program are just a list of integers, and it’s up to the virtual machine to interpret the meaning of those integers.

The compilation process produces the list of integers that the virtual machine will interpret. To compile a program, we simply walk the tree translating nodes and their operands to integers.  The program "hello" + "world" will result in three instructions: two push operations and one add operation. The push instructions have "hello" and "world" as operands.

There are a fixed number of instructions, and we can represent each instruction with an integer. In this case, let’s use the number 7 for push and the number 9 for add. But how can we convert the strings “hello” and “world” to integers?

Half translation of AST to instructions

In order to represent these strings in the instruction sequences, the compiler will add the address of the Ruby object that represents each string. The virtual machine knows that the operand to the push instruction is actually the address of a Ruby object, and will act appropriately. This means that the final instructions will look like this:

Full translation of AST to instructions

After the AST is fully processed, it is thrown away and only the instruction sequences remain:

Instructions after AST is gone

 

Literal liveness in Ruby 2.5

Instruction Sequences are Ruby objects and are managed via Ruby’s garbage collector. As mentioned earlier, string literals are also Ruby objects and are managed by the garbage collector. If the string literals are not marked, they could be collected, and the instruction sequences would point to an invalid address.

To prevent these literal objects from being collected, Ruby 2.5 would maintain a “mark array”. The mark array is simply a Ruby array that contains references to all literals referenced for that set of instruction sequences:

InstructionSequences with Mark Array

Both the mark array and the instructions contain references to the string literals found in the code. But the instruction sequences depend on the mark array to keep the literals from being collected.

Literal liveness in Ruby 2.6+

Ruby 2.6 introduced a patch that eliminates this mark array. When instruction sequences are marked, rather than marking an array, it disassembles the instructions and marks instruction operands that were allocated via the garbage collector. This disassembly process means that the mark array can be completely eliminated:

Instruction Sequences without Mark Array

We found that this reduced the number of live objects in our heap by three percent after the application starts.

Performance

Of course, disassembling instructions is more expensive than iterating an array. We found that only 30 percent of instruction sequences actually contain references to objects that need marking. In order to prevent needless disassembly, instructions that contain objects allocated from Ruby’s garbage collector are flagged at compile time, and only those instruction sequences are disassembled during mark time. On top of this, instruction sequence objects typically become “old”. This means that thanks to Ruby’s generational garbage collector, they are examined very infrequently. As a result, we observed memory reduction with zero cost to throughput.

Have a good day!

The post Direct instruction marking in Ruby 2.6 appeared first on The GitHub Blog.

React Native in GrabPay

Post Syndicated from Grab Tech original https://engineering.grab.com/react-native-in-grabpay

Overview

It wasn’t too long ago that Grab formed a new team, GrabPay, to improve the cashless experience in Southeast Asia and to venture into the promising mobile payments arena. To support the work, Grab also decided to open a new R&D center in Bangalore.

It was an exciting journey for the team from the very beginning, as it gave us the opportunity to experiment with new cutting edge technologies. Our first release was the GrabPay Merchant App, the first all React Native Grab App. Its success gave us the confidence to use React Native to optimize the Grab PAX app.

React Native is an open source mobile application framework. It lets developers use React (a JavaScript library for building user interfaces) with native platform capabilities. Its two big advantages are:

  • We could make cross-platform mobile apps and components completely in JavaScript.
  • Its hot reloading feature significantly reduced development time.

This post describes our work on developing React Native components for Grab apps, the challenges faced during implementation, what we learned from other internal React Native projects, and our future roadmap.

Before embarking on our work with React Native, these were the goals we set out. We wanted to:

  • Have a reusable code between Android and iOS as well as across various Grab apps (Driver app, Merchant app, etc.).
  • Have a single codebase to minimize the effort needed to modify and maintain our code long term.
  • Match the performance and standards of existing Grab apps.
  • Use as few Engineering resources as possible.

Challenges

Many Grab teams located across Southeast Asia and in the United States support the App platform. It was hard to convince all of them to add React Native as a project dependency and write new feature code with React Native. In particular, having React Native dependency significantly increases a project’s binary’s size,

But the initial cost was worth it. We now have only a few modules, all written in React Native:

  • Express
  • Transaction History
  • Postpaid Billpay

As there is only one codebase instead of two, the modules take half the maintenance resources. Debugging is faster with React Native’s hot reloading. And it’s much easier and faster to implement one of our modules in another app, such as DAX.

Another challenge was creating a universally acceptable format for a bridging library to communicate between existing code and React Native modules. We had to define fixed guidelines to create new bridges and define communication protocols between React Native modules and existing code.

Invoking a module written in React Native from a Native Module (written in a standard computer language such as Swift or Kotlin) should follow certain guidelines. Once all Grab’s tech families reached consensus on solutions to these problems, we started making our bridges and doing the groundwork to use React Native.

Foundation

On the native side, we used the Grablet architecture to add our React Native modules. Grablet gave us a wonderful opportunity to scale our Grab platform so it could be used by any tech family to plug and play their module. And the module could be in any of  Native, React Native, Flutter, or Web.

We also created a framework encapsulating all the project’s React Native Binaries. This simplified the React Native Upgrade process. Dependencies for the framework are react, react-native, and react-native-event-bridge.

We had some internal proof of concept projects for determining React Native’s performance on different devices, as discussed here. Many teams helped us make an extensive set of JS bridges for React Native in Android and iOS. Oleksandr Prokofiev wrote this bridge creation example:

publicfinalclassDeviceKitModule: NSObject, RCTBridgeModule {
 privateletdeviceKit: DeviceKitService

 publicinit(deviceKit: DeviceKitService) {
   self.deviceKit = deviceKit
   super.init()
 }
 publicstaticfuncmoduleName() -> String {
   return"DeviceKitModule"
 }
 publicfuncmethodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] = [
     buildGetDeviceID()
     ]
   return methods.compactMap { $0 }
 }

 privatefuncbuildGetDeviceID() -> BridgeMethodWrapper? {
   returnBridgeMethodWrapper("getDeviceID", { [weakself] (_: [Any], _, resolve) in
     letvalue = self?.deviceKit.getDeviceID()
     resolve(value)
   })
 }
}

GrabPay Components and React Native

The GrabPay Merchant App gave us a good foundation for React Native in terms of

  • Component libraries
  • Networking layer and api middleware
  • Real world data for internal assessment of performance and stability

We used this knowledge to build theTransaction History and GrabPay Digital Marketplace components inside the Grab Pax App with React Native.

Component Library

We selected particularly useful components from the Merchant App codebase such as GPText, GPTextInput, GPErrorView, and GPActivityIndicator. We expanded that selection to a common (internal) component library of approximately 20 stateless and stateful components.

API Calls

We used to make api calls using axios (now deprecated). We now make calls from the Native side using bridges that return a promise and make api calls using an existing framework. This helped us remove the dependency for getting an access token from  Native-Android or Native-iOS to make the calls. Also it helped us optimize the api requests, as suggested by Parashuram from Facebook’s React Native team.

Locale

We use React Localize Redux for all our translations and moment for our date time conversion as per the device’s current Locale. We currently support translation in five languages; English, Chinese Simplified, Bahasa Indonesia, Malay, and Vietnamese. This Swift code shows how we get the device’s current Locale from the native-react Native Bridge.

public func methodsToExport() -> [RCTBridgeMethod] {
   let methods: [RCTBridgeMethod?] =  [
     BridgeMethodWrapper("getLocaleIdentifier", { (_, _, resolver) in
     letlocaleIdentifier = self.locale.getLocaleIdentifier()
     resolver(localeIdentifier)
   })]
   return methods.compactMap { $0 }
 }

Redux

Redux is an extremely lightweight predictable state container that behaves consistently in every environment. We use Redux with React Native to manage its state.

For in-app navigation we use react-navigation. It is very flexible in adapting to both the Android and iOS navigation and gesture recognition styles.

End Product

After setting up our foundation bridges and porting skeleton boilerplate code from the GrabPay Merchant App, we wrote two payments modules using GrabPay Digital Marketplace (also known as BillPay), React Native, and Transaction History.

Grab app - Selecting a company

The ios Version is on the left and the Android version is on the right. Not only do their UIs look identical, but also their code is identical. A single codebase lets us debug faster, deliver quicker, and maintain smaller (codebase; apologies but it was necessary for the joke).

Grab app - Company selected

We launched BillPay first in Indonesia, then in Vietnam and Malaysia. So far, it’s been a very stable product with little to no downtime.

Transaction History started in Singapore and is now rolling out in other countries.

Flow For BillPay

BillPay Flow

The above shows BillPay’s flow.

  1. We start with the first screen, called Biller List. It shows all the postpaid billers available for the current region. For now, we show Billers based on which country the user is in. The user selects a biller.
  2. We then asks for your customerID (or prefills that value if you have paid your bill before). The amount is either fetched from the backend or filled in by the user, depending on the region and biller type.
  3. Next, the user confirms all the entered details before they pay the dues.
  4. Finally, the user sees their bill payment receipt. It comes directly from the biller, and so it’s a valid proof of payment.

Our React Native version has kept the same experience as our Native developed App and help users pay their bills seamlessly and hassle free.

Future

We are moving code to Typescript to reduce compile-time bugs and clean up our code. In addition to reducing native dependencies, we will refactor modules as needed. We will also have 100% unit test code coverage. But most importantly, we plan to open source our component library as soon as we feel it is stable.

Preventing Pipeline Calls from Crashing Redis Clusters

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-pipeline-calls-from-crashing-redis-clusters

Introduction

On Feb 15th, 2019, a slave node in Redis, an in-memory data structure storage, failed requiring a replacement. During this period, roughly only 1 in 21 calls to Apollo, a primary transport booking service, succeeded. This brought Grab rides down significantly for the one minute it took the Redis Cluster to self-recover. This behavior was totally unexpected and completely breached our intention of having multiple replicas.

This blog post describes Grab’s outage post-mortem findings.

Understanding the infrastructure

With Grab’s continuous growth, our services must handle large amounts of data traffic involving high processing power for reading and writing operations. To address this significant growth, reduce handler latency, and improve overall performance, many of our services use Redis – a common in-memory data structure storage – as a cache, database, or message broker. Furthermore, we use a Redis Cluster, a distributed implementation of Redis, for shorter latency and higher availability.

Apollo is our driver-side state machine. It is on almost all requests’ critical path and is a primary component for booking transport and providing great service for customer bookings. It stores individual driver availability in an AWS ElastiCache Redis Cluster, letting our booking service efficiently assign jobs to drivers. It’s critical to keep Apollo running and available 24/7.

Apollo's infrastructure

Because of Apollo’s significance, its Redis Cluster has 3 shards each with 2 slaves. It hashes all keys and, according to the hash value, divides them into three partitions. Each partition has two replications to increase reliability.

We use the Go-Redis client, a popular Redis library, to direct all written queries to the master nodes (which then write to their slaves) to ensure consistency with the database.

Master and slave nodes in the Redis Cluster

For reading related queries, engineers usually turn on the ReadOnly flag and turn off the RouteByLatency flag. These effectively turn on ReadOnlyFromSlaves in the Grab gredis3 library, so the client directs all reading queries to the slave nodes instead of the master nodes. This load distribution frees up master node CPU usage.

Client reading and writing from/to the Redis Cluster

When designing a system, we consider potential hardware outages and network issues. We also think of ways to ensure our Redis Cluster is highly efficient and available; setting the above-mentioned flags help us achieve these goals.

Ideally, this Redis Cluster configuration would not cause issues even if a master or slave node breaks. Apollo should still function smoothly. So, why did that February Apollo outage happen? Why did a single down slave node cause a 95+% call failure rate to the Redis Cluster during the dim-out time?

Let’s start by discussing how to construct a local Redis Cluster step by step, then try and replicate the outage. We’ll look at the reasons behind the outage and provide suggestions on how to use a Redis Cluster client in Go.

How to set up a local Redis Cluster

1. Download and install Redis from here.

2. Set up configuration files for each node. For example, in Apollo, we have 9 nodes, so we need to create 9 files like this with different port numbers(x).

// file_name: node_x.conf (do not include this line in file)

port 600x

cluster-enabled yes

cluster-config-file cluster-node-x.conf

cluster-node-timeout 5000

appendonly yes

appendfilename node-x.aof

dbfilename dump-x.rdb

3. Initiate each node in an individual terminal tab with:

$PATH/redis-4.0.9/src/redis-server node_1.conf

4. Use this Ruby script to create a Redis Cluster. (Each master has two slaves.)

$PATH/redis-4.0.9/src/redis-trib.rb create --replicas 2127.0.0.1:6001..... 127.0.0.1:6009

>>> Performing Cluster Check (using node 127.0.0.1:6001)

M: 7b4a5d9a421d45714e533618e4a2b3becc5f8913 127.0.0.1:6001

   slots:0-5460 (5461 slots) master

   2 additional replica(s)

S: 07272db642467a07d515367c677e3e3428b7b998 127.0.0.1:6007

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

S: 65a9b839cd18dcae9b5c4f310b05af7627f2185b 127.0.0.1:6004

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

M: 05363c0ad70a2993db893434b9f61983a6fc0bf8 127.0.0.1:6003

   slots:10923-16383 (5461 slots) master

   2 additional replica(s)

S: a78586a7343be88393fe40498609734b787d3b01 127.0.0.1:6006

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

S: e94c150d910997e90ea6f1100034af7e8b3e0cdf 127.0.0.1:6005

   slots: (0 slots) slave

   replicates 05363c0ad70a2993db893434b9f61983a6fc0bf8

M: 72306f44d3ffa773810c810cfdd53c856cfda893 127.0.0.1:6002

   slots:5461-10922 (5462 slots) master

   2 additional replica(s)

S: ac6ffbf25f48b1726fe8d5c4ac7597d07987bcd7 127.0.0.1:6009

   slots: (0 slots) slave

   replicates 7b4a5d9a421d45714e533618e4a2b3becc5f8913

S: bc56b2960018032d0707307725766ec81e7d43d9 127.0.0.1:6008

   slots: (0 slots) slave

   replicates 72306f44d3ffa773810c810cfdd53c856cfda893

[OK] All nodes agree about slots configuration.

5. Finally, we try to send queries to our Redis Cluster, e.g.

$PATH/redis-4.0.9/src/redis-cli -c -p 6001 hset driverID 100 state available updated_at 11111

What happens when nodes become unreachable?

Redis Cluster Server

As long as the majority of a Redis Cluster’s masters and at least one slave node for each unreachable master are reachable, the cluster is accessible. It can survive even if a few nodes fail.

Let’s say we have N masters, each with K slaves, and random T nodes become unreachable. This algorithm calculates the Redis Cluster failure rate percentage:

if T <= K:
        availability = 100%
else:
        availability = 100% - (1/(N*K - T))

If you successfully built your own Redis Cluster locally, try to kill any node with a simple command-c. The Redis Cluster broadcasts to all nodes that the killed node is now unreachable, so other nodes no longer direct traffic to that port.

If you bring this node back up, all nodes know it’s reachable again. If you kill a master node, the Redis Cluster promotes a slave node to a temp master for writing queries.

$PATH/redis-4.0.9/src/redis-server node_x.conf

With this information, we can’t answer the big question of why a single slave node failure caused an over 95% failure rate in the Apollo outage. Per the above theory, the Redis Cluster should still be 100% available. So, the Redis Cluster server could properly handle an outage, and we concluded it wasn’t the failure rate’s cause. So we looked at the client side and Apollo’s queries.

Golang Redis Cluster Client & Apollo Queries

Apollo’s client side is based on the Go-Redis Library.

During the Apollo outage, we found some code returned many errors during certain pipeline GET calls. When Apollo tried to send a pipeline of HMGET calls to its Redis Cluster, the pipeline returned errors.

First, we looked at the pipeline implementation code in the Go-Redis library. In the function defaultProcessPipeline, the code assigns each command to a Redis node in this line err:=c.mapCmdsByNode(cmds, cmdsMap).

func (c *ClusterClient) mapCmdsByNode(cmds []Cmder, cmdsMap *cmdsMap) error {
state, err := c.state.Get()
        if err != nil {
                setCmdsErr(cmds, err)
                returnerr
        }

        cmdsAreReadOnly := c.cmdsAreReadOnly(cmds)
        for_, cmd := range cmds {
                var node *clusterNode
                var err error
                if cmdsAreReadOnly {
                        _, node, err = c.cmdSlotAndNode(cmd)
                } else {
                        slot := c.cmdSlot(cmd)
                        node, err = state.slotMasterNode(slot)
                }
                if err != nil {
                        returnerr
                }
                cmdsMap.mu.Lock()
                cmdsMap.m[node] = append(cmdsMap.m[node], cmd)
                cmdsMap.mu.Unlock()
        }
        return nil
}

Next, since the readOnly flag is on, we look at the cmdSlotAndNode function. As mentioned earlier, you can get better performance by setting readOnlyFromSlaves to true, which sets RouteByLatency to false. By doing this, RouteByLatency will not take priority and the master does not receive the read commands.

func (c *ClusterClient) cmdSlotAndNode(cmd Cmder) (int, *clusterNode, error) {
        state, err := c.state.Get()
        if err != nil {
                return 0, nil, err
        }

        cmdInfo := c.cmdInfo(cmd.Name())
        slot := cmdSlot(cmd, cmdFirstKeyPos(cmd, cmdInfo))

        if c.opt.ReadOnly && cmdInfo != nil && cmdInfo.ReadOnly {
                if c.opt.RouteByLatency {
                        node, err:= state.slotClosestNode(slot)
                        return slot, node, err
                }

                if c.opt.RouteRandomly {
                        node:= state.slotRandomNode(slot)
                        return slot, node, nil
                }

                node, err:= state.slotSlaveNode(slot)
                return slot, node, err
        }

        node, err:= state.slotMasterNode(slot)
        return slot, node, err
}

Now, let’s try and better understand the outage.

  1. When a slave becomes unreachable, all commands assigned to that slave node fail.
  2. We found in Grab’s Redis library code that a single error in all cmds could cause the entire pipeline to fail.
  3. In addition, engineers return a failure in their code if err != nil. This explains the high failure rate during the outage.
func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

Our next question was, “Why did it take almost one minute for Apollo to recover?”.  The Redis Cluster broadcasts instantly to its other nodes when one node is unreachable. So we looked at how the client assigns jobs.

When the Redis Cluster client loads the node states, it only refreshes the state once a minute. So there’s a maximum one minute delay of state changes between the client and server. Within that minute, the Redis client kept sending queries to that unreachable slave node.

func (c *clusterStateHolder) Get() (*clusterState, error) {
        v := c.state.Load()
        if v != nil {
                state := v.(*clusterState)
                if time.Since(state.createdAt) > time.Minute {
                        c.LazyReload()
                }
                return state, nil
        }
        return c.Reload()
}

What happened to the write queries? Did we lose new data during that one min gap? That’s a very good question! The answer is no since all write queries only went to the master nodes and the Redis Cluster client with a watcher for the master nodes. So, whenever any master node becomes unreachable, the client is not oblivious to the change in state and is well aware of the current state. See the Watcher code.

How to use Go Redis safely?

Redis Cluster Client

One way to avoid a potential outage like our Apollo outage is to create another Redis Cluster client for pipelining only and with a true RouteByLatency value. The Redis Cluster determines the latency according to ping calls to its server.

In this case, all pipelining queries would read through the master nodesif the latency is less than 1ms (code), and as long as the majority side of partitions are alive, the client will get the expected results. More load would go to master with this setting, so be careful about CPU usage in the master nodes when you make the change.

Pipeline Usage

In some cases, the master nodes might not handle so much traffic. Another way to mitigate the impact of an outage is to check for  errors on individual queries when errors happen in a pipeline call.

In Grab’s Redis Cluster library, the function Pipeline(PipelineReadOnly) returns a response with an error for individual reply.

func (c *clientImpl) Pipeline(ctx context.Context, argsList [][]interface{}) ([]gredisapi.ReplyPair, error) {
        defer c.stats.Duration(statsPkgName, metricElapsed, time.Now(), c.getTags(tagFunctionPipeline)...)
        pipe := c.wrappedClient.Pipeline()
        cmds := make([]goredis.Cmder, len(argsList))
        for i, args := range argsList {
                cmd := goredis.NewCmd(args...)
                cmds[i] = cmd
                _ = pipe.Process(cmd)
        }
        _, _ = pipe.Exec()
        return c.wrappedClient.getResultFromCommands(cmds)
}

func (w *goRedisWrapperImpl) getResultFromCommands(cmds []goredis.Cmder) ([]gredisapi.ReplyPair, error) {
        results := make([]gredisapi.ReplyPair, len(cmds))
        var err error
        for idx, cmd := range cmds {
                results[idx].Value, results[idx].Err = cmd.(*goredis.Cmd).Result()
                if results[idx].Err == goredis.Nil {
                        results[idx].Err = nil
                        continue
                }
                if err == nil && results[idx].Err != nil {
                        err = results[idx].Err
                }
        }

        return results, err
}

type ReplyPair struct {
        Value interface{}
        Err   error
}

Instead of returning nil or an error message when err != nil, we could check for errors for each result so successful queries are not affected. This might have minimized the outage’s business impact.

Go Redis Cluster Library

One way to fix the Redis Cluster library is to reload nodes’ status when an error happens.In the go-redis library, defaultProcessor has this logic, which can be applied to defaultProcessPipeline.

In Conclusion

We’ve shown how to build a local Redis Cluster server, explained how Redis Clusters work, and identified its potential risks and solutions. Redis Cluster is a great tool to optimize service performance, but there are potential risks when using it. Please carefully consider our points about how to best use it. If you have any questions, please ask them in the comments section.

Loki, a dynamic mock server for HTTP/TCP testing

Post Syndicated from Grab Tech original https://engineering.grab.com/loki-dynamic-mock-server-http-tcp-testing

Background

In a previous article we introduced Mockers – an innovative tool for local box testing at Grab. Mockers used a Shift Left testing strategy, making testing more effective and cheaper for development teams. Mockers’ popularity and success motivated us to create Loki – a one-stop dynamic mock server for local box testing of mobile apps.

There are some unique challenges in mobile apps testing at Grab. End-to-end testing of an app is difficult due to high dependency on backend services and other apps. Staging environment, which hosts a plethora of backend services, is tough to manage and maintain. Issues such as staging downtime, configuration mismatches, and data corruption can affect staging adding to the testing woes. Moreover, our apps are fairly complex, utilizing multiple transport protocols such as HTTP, HTTPS, TCP for various business flows.

The business flows are also complex, requiring exhaustive set up such as credit card payments set up, location spoofing, etc resulting in high maintenance costs for automated testing. Loki simulates these flows and developers can easily test use cases that take longer to set up in a real backend staging.

Loki is our attempt to address challenges in mobile app testing by turning every developer local box into a full fledged pseudo backend environment where all mobile workflows can be tested without any external dependencies. It mocks backend services on developer local boxes, decoupling the mobile apps from real backend services, which provides several advantages such as:

No need to deploy frequently to staging

Testing is blocked if the app receives a bad response from staging. In these cases, code changes have to be deployed on staging to fix issues before resuming tests. In contrast, using Loki lets developers continue testing without any immediate need to deploy code changes to staging.

Allows parallel frontend and backend development

Loki acts as a mock backend service when the real backend is still evolving. It lets the frontend development run in parallel with backend development.

Overcome time limitations

In a one week regression-and-release scenario, testing time is limited. However, the application UI rendering and functionality still needs reasonable testing. Loki lets developers concentrate on testing in the available time instead of fixing dependencies on backend services.

Loki – Grab’s solution to simplify mobile apps testing

At Grab, we have multiple mobile apps that are dependent on each other. For example, our Passenger and Driver apps are two sides of a coin; the driver gets a job card only when a passenger requests a booking. These apps are developed by different teams, each with its own release cycle. This can make it tricky to confidently and repeatedly test the whole business flow across apps. Apps also depend on multiple backend services to execute a booking or food order and communicate over different protocols.

Here’s a look at how our mobile apps interact with backend services over different protocols:

Mobile app interaction with backend services

Loki is a dynamic mock server, written in Golang, running in a Docker container on the local box or in CI. It is easy to set up and run through standard Docker commands. In the context of mobile app testing, it plays the role of backend services, so you no longer need to set up an extensive staging environment.

The Loki architecture looks like this:

Loki architecture

The technical challenges we had to overcome

We wanted a comprehensive mocking solution so that teams don’t need to integrate multiple tools to achieve independent testing. It turned out that mocking TCP was most challenging because:

  • It is a long running client-server connection, and it doesn’t follow an HTTP-like request/response pattern.
  • Messages can be sent to the app without an incoming request as well, hence we had to expose a way via Loki to set a mock expectation which can send messages to the app without any request triggering it.
  • As TCP is a long running connection, we needed a way to delimit incoming requests so we know when we can truncate and deserialize the incoming request into JSON.

We engineered the Loki backend to support both HTTP and TCP protocols on different ports. Yet, the mock expectations are set up using RESTful APIs over HTTP for both protocols. A single point of entry for setting expectations made it more intuitive for our developers.

An in-memory cron implementation pushes scheduled messages to the app over a TCP connection. This enabled testing of complex use cases such as drivers getting new job cards, driver and passenger chat workflows, etc. The delimiter for TCP protocol is configurable at start up, so each team can decide when to truncate the request.

To enable Loki on our CI, we had to reduce its memory footprint. Hence, we built Loki with pluggable storages. MySQL is used when running on local and on CI we switch seamlessly to in-memory cache or Redis.

For testing apps locally, developers must validate complex use cases such as:

  • Payment related flows, which require the response to include the same payment ID as sent in the request. This is a case of simple mapping of request fields in the response JSON.

  • Flows requiring runtime logic execution. For example, a job card sent to a driver must have a valid timestamp, requiring runtime computation on Loki.

To support these cases and many more, we added JavaScript injection capability to Loki. So, when we set an expectation for an HTTP request/response pair or for TCP events, we can specify JavaScript for computing the dynamic response. This is executed in a sandbox by an in-house JS execution library.

Grab follows a transactional workflow for bookings. Over the life of a ride, bookings go through different statuses. So, Loki had to address multiple HTTP requests to the same endpoint returning different responses. This feature is required for successfully mocking a whole ride end-to-end.

Loki uses  an HTTP API “httpTimesAndOrder” for this feature. For example, using “httpTimesAndOrder”, you can configure the same status endpoint (/ride/status) to return different ride statuses such as “PICKING” for the first five requests, “IN_RIDE” for the next three requests, and so on.

Now, let’s look at how to use Loki to mock HTTP requests and TCP events.

Mocking HTTP requests

To mock HTTP requests, developers first point their app to send requests to the Loki mock server. Then, they set up expectations for all requests sent to the Loki mock server.

Loki mock server

For example, the Passenger app calls an HTTP dependency GET /closeby/drivers/ to get nearby drivers. To mock it with Loki, you set an expected response on the Loki mock server. When the GET /closeby/drivers/ request is actually made from the Passenger app, Loki returns the set response.

This snippet shows how to set an expected response for the GET /closeby/drivers/request:

Loki API: POST `/api/v1/expectations`

Request Body :

{
  "uriToMock": "/closeby/drivers",
  "method": "GET",
  "response": {
    "drivers": [
      1001,
      1002,
      1010
    ]
  }
}

Workflow for setting expectations and receiving responses

Workflow for setting expectations and receiving responses

Mocking TCP events

Developers point their app to Loki over a TCP connection and set up the TCP expectations. Loki then generates scheduled events such as sending push messages (job cards, notifications, etc) to the apps pointing at Loki.

For example, if the Driver app, after it starts, wants to get a job card, you can set an expectation in Loki to push a job card over the TCP connection to the Driver app after a scheduled time interval.

This snippet shows how to set the TCP expectation and schedule a push message:

Loki API: POST `/api/v1/tcp/expectations/pushmessage`

Request Body :

{
  "name": "samplePushMsg",
  "msgSequence": [
    {
      "messages": {
        "body": {
          "jobCardID": 1001
        }
      }
    },
    {
      "messages": {
        "body": {
          "jobCardID": 1002
        }
      }
    }
  ],
  "schedule": "@every 1m"
}

Workflow for scheduling a push message over TCP

Workflow for scheduling a push message over TCP

Some example use cases

Now that you know about Loki, let’s look at some example use cases.

Generating a custom response at runtime

Our first example is customizing a runtime response for both HTTP and TCP requests. This is helpful when developers need dynamic responses to requests. For example, you can add parameters from the request URL or request body to the runtime response.

It’s simple to implement this with a JavaScript function. Assume you want to embed a message parameter in the request URL to the response. To do this, you first use a POST method to set up the expectation (in JSON format) for the request on Loki:

Loki API: POST `/api/v1/feature/expectations`

Request Body :

{
  "expectations": [{
    "name": "Sample call",
    "desc": "v1/test/{name}",
    "tags": "v1/test/{name}",
    "resource": "/v1/test?name=user1",
    "verb": "POST",
    "response": {
      "body": "{ \"msg\": \"Hi \"}",
      "status": 200
    },
    "clientOptions": {
"javascript": "function main(req, resp) { var url = req.RequestURI; var captured = /name=([^&]+)/.exec(url)[1]; resp.msg =  captured ? resp.msg + captured : resp.msg + 'myDefaultValue'; return resp }"
    },
    "isActive": 1
  }]
}

When Loki receives the request, the JavaScript function used in the clientOptionskey, adds name to the response at runtime. For example, this is the request’s fixed response:

{
    "msg": "Hi "
}

But, after using the JavaScript function to add the URL parameter, the dynamic response is:

{
    "msg": "Hi user1"
}

Similarly, you can use JavaScript to add other dynamic responses such as modifying the response’s JSON array, adding parameters to push messages, etc.

Defining a response sequence for mocked API endpoints

Here’s another interesting example – defining the response sequence for API endpoints.

A response sequence is useful when you need different responses from the same API endpoint. For example, a status endpoint should return different ride statuses such as ‘allocating’, ‘allocated’, ‘picking’, etc. depending on the stage of a ride.

To do this, developers set up their HTTP expectations on Loki. Then, they easily define the response sequence for an API endpoint using a Loki POST method.

In this example:

  • times – specifies the number of times the same response is returned.
  • after – specifies one or more expectations that must match before a specified expectation is matched.

Here, the expectations are matched in this sequence when a request is made to an endpoint – Allocating > Allocated > Pickuser > Completed. Further, Completed is set to two times, so Loki returns this response two times.

Loki API: POST `/api/v1/feature/sequence`

Request Body :
  "httpTimesAndOrder": [
      {
          "name": "Allocating",
          "times": 1
      },
      {
          "name": "Allocated",
          "times": 1,
          "after": ["Allocating"]
      },
      {
          "name": "Pickuser",
          "times": 1,
          "after": ["Allocated"]
      },
      {
          "name": "Completed",
          "times": 2,
          "after": ["Pickuser"]
      }
  ]
}

In conclusion

Since Loki’s inception, we have set up a full range CI with proper end-to-end app UI tests and, to a great extent, decoupled our app releases from the staging backend. This improved delivery cycles, and we did faster bug catching and more exhaustive testing. Moreover, both developers and QAs can easily play with apps to perform exploratory testing as well as manual functional validations. Teams are also using Loki to run automated scripts (Espresso and XCUItests) for validating the mobile app pages.

Loki’s adoption is growing steadily at Grab. With our frequent release of new mobile app features, Loki helps teams meet our high quality bar and achieve huge productivity gains.

If you have any feedback or questions on Loki, please leave a comment.