Our Journey to Continuous Delivery at Grab (Part 2)

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab-part2

In the first part of this blog post, you’ve read about the improvements made to our build and staging deployment process, and how plenty of manual tasks routinely taken by engineers have been automated with Conveyor: an in-house continuous delivery solution.

This new post begins with the introduction of the hermeticity principle for our deployments, and how it improves the confidence with promoting changes to production. Changes sent to production via Conveyor’s deployment pipelines are then described in detail.

Overview of Grab delivery process
Overview of Grab delivery process

Finally, looking back at the engineering efficiency improvements around velocity and reliability over the last 2 years, we answer the big question – was the investment on a custom continuous delivery solution like Conveyor the right decision for Grab?

Improving Confidence in our Production Deployments with Hermeticity

The term deployment hermeticity is borrowed from build systems. A build system is called hermetic if builds always produce the same artefacts regardless of changes in the environment they run on. Similarly, we call our deployments hermetic if they always result in the same deployed artefacts regardless of the environment’s change or the number of times they are executed.

The behaviour of a service is rarely controlled by a single variable. The application that makes up your service is an important driver of its behaviour, but its configuration is an important contributor, for example. The behaviour for traditional microservices at Grab is dictated mainly by 3 versioned artefacts: application code, static and dynamic configuration.

Conveyor has been integrated with the systems that operate changes in each of these parameters. By tracking all 3 parameters at every deployment, Conveyor can reproducibly deploy microservices with similar behaviour: its deployments are hermetic.

Building upon this property, Conveyor can ensure that all deployments made to production have been tested before with the same combination of parameters. This is valuable to us:

  • An outcome of staging deployments for a specific set of parameters is a good predictor of outcomes in production deployments for the same set of parameters and thus it makes testing in staging more relevant.
  • Rollbacks are hermetic; we never rollback to a combination of parameters that has not been used previously.

In the past, incidents had resulted from an application rollback not compatible with the current dynamic configuration version; this was aggravating since rollbacks are expected to be a safe recovery mechanism. The introduction of hermetic deployments has largely eliminated this category of problems.

Hermeticity is maintained by registering the deployment parameters as artefacts after each successfully completed pipeline. Users must then select one of the registered deployment metadata to promote to production.

At this point, you might be wondering: why not use a single pipeline that includes both staging and production deployments? This was indeed how it started, with a single pipeline spanning multiple environments. However, engineers soon complained about it.

The most obvious reason for the complaint was that less than 20% of changes deployed in staging will make their way to production. This meant that engineers would have toil associated with each completed staging deployment since the pipeline must be manually cancelled rather than continued to production.

The other reason is that this multi-environment pipeline approach reduced flexibility when promoting changes to production. There are different ways to apply changes to a cluster. For example, lengthy pipelines that refresh instances can be used to deploy any combination of changes, while there are quicker pipelines restricted to dynamic configuration changes (such as feature flags rollouts). Regardless of the order in which the changes are made and how they are applied, Conveyor tracks the change.

Eventually, engineers promote a deployment artefact to production. However they do not need to apply changes in the same sequence with which were applied to staging. Furthermore, to prevent erroneous actions, Conveyor presents only changes that can be applied with the requested pipeline (and sometimes, no changes are available). Not being forced into a specific method of deploying changes is one of added benefits of hermetic deployments.

Returning to Our Journey Towards Engineering Efficiency

If you can recall, the first part of this blog post series ended with a description of staging deployment. Our deployment to production starts with a verification that we uphold our hermeticity principle, as explained above.

Our production deployment pipelines can run for several hours for large clusters with rolling releases (few run for days), so we start by acquiring locks to ensure there are no concurrent deployments for any given cluster.

Before making any changes to the environment, we automatically generate release notes, giving engineers a chance to abort if the wrong set of changes are sent to production.

The pipeline next waits for a deployment slot. Early on, engineers adopted deployment windows that coincide with office hours, such that if anything goes wrong, there is always someone on hand to help. Prior to the introduction of Conveyor, however, engineers would manually ask a Slack bot for approval. This interaction is now automated, and the only remaining action left is for the engineer to approve that the deployment can proceed via a single click, in line with our hands-off deployment principle.

When the canary is in production, Conveyor automates monitoring it. This process is similar to the one already discussed in the first part of this blog post: Engineers can configure a set of alerts that Conveyor will keep track of. As soon as any one of the alerts is triggered, Conveyor automatically rolls back the service.

If no alert is raised for the duration of the monitoring period, Conveyor waits again for a deployment slot. It then publishes the release notes for that deployment and completes the deployments for the cluster. After the lock is released and the deployment registered, the pipeline finally comes to its successful completion.

Benefits of Our Journey Towards Engineering Efficiency

All these improvements made over the last 2 years have reduced the effort spent by engineers on deployment while also reducing the failure rate of our deployments.

If you are an engineer working on DevOps in your organisation, you know how hard it can be to measure the impact you made on your organisation. To estimate the time saved by our pipelines, we can model the activities that were previously done manually with a rudimentary weighted graph. In this graph, each edge carries a probability of the activity being performed (100% when unspecified), while each vertex carries the time taken for that activity.

Focusing on our regular staging deployments only, such a graph would look like this:

The overall amount of effort automated by the staging pipelines () is represented in the graph above. It can be converted into the equation below:

This equation shows that for each staging deployment, around 16 minutes of work have been saved. Similarly, for regular production deployments, we find that 67 minutes of work were saved for each deployment:

Moreover, efficiency was not the only benefit brought by the use of deployment pipelines for our traditional microservices. Surprisingly perhaps, the rate of failures related to production changes is progressively reducing while the amount of production changes that were made with Conveyor increased across the organisation (starting at 1.5% of failures per deployments, and finishing at 0.3% on average over the last 3 months for the period of data collected):

Keep Calm and Automate

Since the first draft for this post was written, we’ve made many more improvements to our pipelines. We’ve begun automating Database Migrations; we’ve extended our set of hermetic variables to Amazon Machine Image (AMI) updates; and we’re working towards supporting container deployments.

Through automation, all of Conveyor’s deployment pipelines have contributed to save more than 5,000 man-days of efforts in 2020 alone, across all supported teams. That’s around 20 man-years worth of effort, which is around 3 times the capacity of the team working on the project! Investments in our automation pipelines have more than paid for themselves, and the gains go up every year as more workflows are automated and more teams are onboarded.

If Conveyor has saved efforts for engineering teams, has it then helped to improve velocity? I had opened the first part of this blog post with figures on the deployment funnel for microservice teams at Grab, towards the end of 2018. So where do the figures stand today for these teams?

In the span of 2 years, the average number of build and staging deployment performed each day has not varied much. However, in the last 3 months of 2020, engineers have sent twice more changes to production than they did for the same period in 2018.

Perhaps the biggest recognition received by the team working on the project, was from Grab’s engineers themselves. In the 2020 internal NPS survey for engineering experience at Grab, Conveyor received the highest score of any tools (built in-house or not).


All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Stanley Goh, Htet Aung Shine, Evan Sebastian, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal and many others who have contributed and collaborated with them.


Join Us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

The Syslog Hell

Post Syndicated from Bozho original https://techblog.bozho.net/the-syslog-hell/

Syslog. You’ve probably heard about that, especially if you are into monitoring or security. Syslog is perceived to be the common, unified way that systems can send logs to other systems. Linux supports syslog, many network and security appliances support syslog as a way to share their logs. On the other side, a syslog server is receiving all syslog messages. It sounds great in theory – having a simple, common way to represent logs messages and send them across systems.

Reality can’t be further from that. Syslog is not one thing – there are multiple “standards”, and each of those is implemented incorrectly more often than not. Many vendors have their own way of representing data, and it’s all a big mess.

First, the RFCs. There are two RFCs – RFC3164 (“old” or “BSD” syslog) and RFC5424 (the new variant that obsoletes 3164). RFC3164 is not a standard, while RFC5424 is (mostly).

Those RFCs concern the contents of a syslog message. Then there’s RFC6587 which is about transmitting a syslog message over TCP. It’s also not a standard, but rather “an observation”. Syslog is usually transmitted over UDP, so fitting it into TCP requires some extra considerations. Now add TLS on top of that as well.

Then there are content formats. RFC5424 defines a key-value structure, but RFC 3164 does not – everything after the syslog header is just a non-structured message string. So many custom formats exist. For example firewall vendors tend to define their own message formats. At least they are often documented (e.g. check WatchGuard and SonicWall), but parsing them requires a lot of custom knowledge about that vendor’s choices. Sometimes the documentation doesn’t fully reflect the reality, though.

Instead of vendor-specific formats, there are also de-facto standards like CEF and the less popular LEEF. They define a structure of the message and are actually syslog-independent (you can write CEF/LEEF to a file). But when syslog is used for transmitting CEF/LEEF, the message should respect RFC3164.

And now comes the “fun” part – incorrect implementations. Many vendors don’t really respect those documents. They come up with their own variations of even the simplest things like a syslog header. Date formats are all over the place, hosts are sometimes missing, priority is sometimes missing, non-host identifiers are used in place of hosts, colons are placed frivolously.

Parsing all of that mess is extremely “hacky”, with tons of regexes trying to account for all vendor quirks. I’m working on a SIEM, and our collector is open source – you can check our syslog package. Some vendor-specific parsers are yet missing, but we are adding new ones constantly. The date formats in the CEF parser tell a good story.

If it were just two RFCs with one de-facto message format standard for one of them and a few option for TCP/UDP transmission, that would be fine. But what makes things hell is the fact that too many vendors decided not to care about what is in the RFCs, they decided that “hey, putting a year there is just fine” even though the RFC says “no”, that they don’t really need to set a host in the header, and that they didn’t really need to implement anything new after their initial legacy stuff was created.

Too many vendors (of various security and non-security software) came up with their own way of essentially representing key-value pairs, too many vendors thought their date format is the right one, too many vendors didn’t take the time to upgrade their logging facility in the past 12 years.

Unfortunately that’s representative of our industry (yes, xkcd). Someone somewhere stitches something together and then decades later we have an incomprehensible patchwork of stringly-typed, randomly formatted stuff flying around whatever socket it finds suitable. And it’s never the right time and the right priority to clean things up, to get up to date, to align with others in the field. We, as an industry (both security and IT in general) are creating a mess out of everything. Yes, the world is complex, and technology is complex as well. Our job is to make it all palpable, abstracted away, simplified and standardized. And we are doing the opposite.

The post The Syslog Hell appeared first on Bozho's tech blog.

Ексклузивни снимки и видео Илчовски пред „Биволъ“: Караха и Цеко Минев да продаде ПИБ

Post Syndicated from Николай Марченко original https://bivol.bg/%D0%B8%D0%BB%D1%87%D0%BE%D0%B2%D1%81%D0%BA%D0%B8-%D0%BF%D1%80%D0%B5%D0%B4-%D0%B1%D0%B8%D0%B2%D0%BE%D0%BB%D1%8A-%D0%BA%D0%B0%D1%80%D0%B0%D1%85%D0%B0-%D0%B8-%D1%86%D0%B5%D0%BA%D0%BE.html

събота 8 май 2021


Банкерът Цеко Минев е бил принуждаван да продаде „Първа Инвестиционна Банка“ АД на Оманския фонд, инвестирал по-рано в Корпоративна търговска банка АД. Това твърдение застъпи пред „Биволъ“ земеделският производител Светослав…

LibInjection – Detect SQL Injection (SQLi) and Cross-Site Scripting (XSS)

Post Syndicated from original https://www.darknet.org.uk/2021/05/libinjection-detect-sql-injection-sqli-and-cross-site-scripting-xss/?utm_source=rss&utm_medium=social&utm_campaign=darknetfeed

LibInjection – Detect SQL Injection (SQLi) and Cross-Site Scripting (XSS)

LibInjection is a C library to Detect SQL Injection (SQLi) and Cross-Site Scripting (XSS) through lexical analysis of real-world Attacks.

SQLi and other injection attacks remain the top OWASP and CERT vulnerability. Current detection attempts frequently involve a myriad of regular expressions which are not only brittle and error-prone but also proven by Hanson and Patterson at Black Hat 2005 to never be a complete solution. LibInjection is a new open-source C library that detects SQLi using lexical analysis.

Read the rest of LibInjection – Detect SQL Injection (SQLi) and Cross-Site Scripting (XSS) now! Only available at Darknet.

More doorbell adventures

Post Syndicated from original https://mjg59.dreamwidth.org/56917.html

Back in my last post on this topic, I’d got shell on my doorbell but hadn’t figured out why the HTTP callbacks weren’t always firing. I still haven’t, but I have learned some more things.

Doorbird sell a chime, a network connected device that is signalled by the doorbell when someone pushes a button. It costs about $150, which seems excessive, but would solve my problem (ie, that if someone pushes the doorbell and I’m not paying attention to my phone, I miss it entirely). But given a shell on the doorbell, how hard could it be to figure out how to mimic the behaviour of one?

Configuration for the doorbell is all stored under /mnt/flash, and there’s a bunch of files prefixed 1000eyes that contain config (1000eyes is the German company that seems to be behind Doorbird). One of these was called 1000eyes.peripherals, which seemed like a good starting point. The initial contents were {"Peripherals":[]}, so it seemed likely that it was intended to be JSON. Unfortunately, since I had no access to any of the peripherals, I had no idea what the format was. I threw the main application into Ghidra and found a function that had debug statements referencing “initPeripherals and read a bunch of JSON keys out of the file, so I could simply look at the keys it referenced and write out a file based on that. I did so, and it didn’t work – the app stubbornly refused to believe that there were any defined peripherals. The check that was failing was pcVar4 = strstr(local_50[0],PTR_s_"type":"_0007c980);, which made no sense, since I very definitely had a type key in there. And then I read it more closely. strstr() wasn’t being asked to look for "type":, it was being asked to look for "type":". I’d left a space between the : and the opening ” in the value, which meant it wasn’t matching. The rest of the function seems to call an actual JSON parser, so I have no idea why it doesn’t just use that for this part as well, but deleting the space and restarting the service meant it now believed I had a peripheral attached.

The mobile app that’s used for configuring the doorbell now showed a device in the peripherals tab, but it had a weird corrupted name. Tapping it resulted in an error telling me that the device was unavailable, and on the doorbell itself generated a log message showing it was trying to reach a device with the hostname bha-04f0212c5cca and (unsurprisingly) failing. The hostname was being generated from the MAC address field in the peripherals file and was presumably supposed to be resolved using mDNS, but for now I just threw a static entry in /etc/hosts pointing at my Home Assistant device. That was enough to show that when I opened the app the doorbell was trying to call a CGI script called peripherals.cgi on my fake chime. When that failed, it called out to the cloud API to ask it to ask the chime[1] instead. Since the cloud was completely unaware of my fake device, this didn’t work either. I hacked together a simple server using Python’s HTTPServer and was able to return data (another block of JSON). This got me to the point where the app would now let me get to the chime config, but would then immediately exit. adb logcat showed a traceback in the app caused by a failed assertion due to a missing key in the JSON, so I ran the app through jadx, found the assertion and from there figured out what keys I needed. Once that was done, the app opened the config page just fine.

Unfortunately, though, I couldn’t edit the config. Whenever I hit “save” the app would tell me that the peripheral wasn’t responding. This was strange, since the doorbell wasn’t even trying to hit my fake chime. It turned out that the app was making a CGI call to the doorbell, and the thread handling that call was segfaulting just after reading the peripheral config file. This suggested that the format of my JSON was probably wrong and that the doorbell was not handling that gracefully, but trying to figure out what the format should actually be didn’t seem easy and none of my attempts improved things.

So, new approach. Rather than writing the config myself, why not let the doorbell do it? I should be able to use the genuine pairing process if I could mimic the chime sufficiently well. Hitting the “add” button in the app asked me for the username and password for the chime, so I typed in something random in the expected format (six characters followed by four zeroes) and a sufficiently long password and hit ok. A few seconds later it told me it couldn’t find the device, which wasn’t unexpected. What was a little more unexpected was that the log on the doorbell was showing it trying to hit another bha-prefixed hostname (and, obviously, failing). The hostname contains the MAC address, but I hadn’t told the doorbell the MAC address of the chime, just its username. Some more digging showed that the doorbell was calling out to the cloud API, giving it the 6 character prefix from the username and getting a MAC address back. Doing the same myself revealed that there was a straightforward mapping from the prefix to the mac address – changing the final character from “a” to “b” incremented the MAC by one. It’s actually just a base 26 encoding of the MAC, with aaaaaa translating to 00408C000000.

That explained how the hostname was being generated, and in return I was able to work backwards to figure out which username I should use to generate the hostname I was already using. Attempting to add it now resulted in the doorbell making another CGI call to my fake chime in order to query its feature set, and by mocking that up as well I was able to send back a file containing X-Intercom-Type, X-Intercom-TypeId and X-Intercom-Class fields that made the doorbell happy. I now had a valid JSON file, which cleared up a couple of mysteries. The corrupt name was because the name field isn’t supposed to be ASCII – it’s base64 encoded UTF16-BE. And the reason I hadn’t been able to figure out the JSON format correctly was because it looked something like this:

{"Peripherals":[]{"prefix":{"type":"DoorChime","name":"AEQAbwBvAHIAYwBoAGkAbQBlACAAVABlAHMAdA==","mac":"04f0212c5cca","user":"username","password":"password"}}]}

Note that there’s a total of one [ in this file, but two ]s? Awesome. Anyway, I could now modify the config in the app and hit save, and the doorbell would then call out to my fake chime to push config to it. Weirdly, the association between the chime and a specific button on the doorbell is only stored on the chime, not on the doorbell. Further, hitting the doorbell didn’t result in any more HTTP traffic to my fake chime. However, it did result in some broadcast UDP traffic being generated. Searching for the port number led me to the Doorbird LAN API and a complete description of the format and encryption mechanism in use. Argon2I is used to turn the first five characters of the chime’s password (which is also stored on the doorbell itself) into a 256-bit key, and this is used with ChaCha20 to decrypt the payload. The payload then contains a six character field describing the device sending the event, and then another field describing the event itself. Some more scrappy Python and I could pick up these packets and decrypt them, which showed that they were being sent whenever any event occurred on the doorbell. This explained why there was no storage of the button/chime association on the doorbell itself – the doorbell sends packets for all events, and the chime is responsible for deciding whether to act on them or not.

On closer examination, it turns out that these packets aren’t just sent if there’s a configured chime. One is sent for each configured user, avoiding the need for a cloud round trip if your phone is on the same network as the doorbell at the time. There was literally no need for me to mimic the chime at all, suitable events were already being sent.

Still. There’s a fair amount of WTFery here, ranging from the strstr() based JSON parsing, the invalid JSON, the symmetric encryption that uses device passwords as the key (requiring the doorbell to be aware of the chime’s password) and the use of only the first five characters of the password as input to the KDF. It doesn’t give me a great deal of confidence in the rest of the device’s security, so I’m going to keep playing.

[1] This seems to be to handle the case where the chime isn’t on the same network as the doorbell

comment count unavailable comments

GitHub Availability Report: April 2021

Post Syndicated from Keith Ballinger original https://github.blog/2021-05-05-github-availability-report-april-2021/

In April, we experienced two incidents resulting in significant impact and degraded state of availability for API requests and the GitHub Packages service, specifically the GitHub Packages Container registry service.

April 1 21:30 UTC (lasting one hour and 34 minutes)

This incident was caused by failures in our DNS resolution, resulting in a degraded state of availability for the GitHub Packages Container registry service. During this incident, some of our internal services that support the Container registry experienced intermittent failures when trying to connect to dependent services. The inability to resolve requests to these services resulted in users being unable to push new container images to the Container registry as well as pull existing images. The Container registry is currently in a public beta, and only beta users were impacted during this incident. The broader GitHub Packages service remained unaffected.

As a next step, we are looking at increasing the cache times of our DNS resolutions to decrease the impact of intermittent DNS resolution failures in the future.

April 22 17:16 UTC (lasting 53 minutes)

Our service monitors detected an elevated error rate when using API requests, which resulted in a degraded state of availability for repository creation. Upon further investigation of this incident, we identified the issue was caused by a bug from a recent data migration. In a data migration to isolate our secret scanning tables into their own cluster, a bug was discovered that broke the ability of the application to successfully write to the secret scanning database. The incident revealed a hitherto unknown dependency that repository creation had upon secret scanning, which makes a call for every repository created. Due to this dependency, repository creation was blocked until we were able to roll back the data migration.

As next steps, we are actively working with our vendor to update the data migration tool and have amended our migration process to include revised steps for remediation, in case similar incidents occur. Furthermore, our application code has been updated to remove the dependency on secret scanning for the creation of repositories.

In summary

From scaling the GitHub API to improving large monorepo performance, we will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To learn more about what we’re working on, check out the GitHub engineering blog.

Netflix Drive

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/netflix-drive-a607538c3055

A file and folder interface for Netflix Cloud Services

Written by Vikram Krishnamurthy, Kishore Kasi, Abhishek Kapatkar, Tejas Chopra, Prudhviraj Karumanchi, Kelsey Francis, Shailesh Birari

In this post, we are introducing Netflix Drive, a Cloud drive for media assets and providing a high level overview of some of its features and interfaces. We intend this to be a first post in a series of posts covering Netflix Drive. In the future posts, we will do an architectural deep dive into the several components of Netflix Drive.

Netflix, and particularly Studio applications (and Studio in the Cloud) produce petabytes of data backed by billions of media assets. Several artists and workflows that may be globally distributed, work on different projects, and each of these projects produce content that forms a part of the large corpus of assets.

Here is an example of globally distributed production where several artists and workflows work in conjunction to create and share assets for one or many projects.

Fig 1: Globally distributed production with artists working on different assets from different parts of the world

There are workflows in which these artists may want to view a subset of these assets from this large dataset, for example, pertaining to a specific project. These artists may want to create personal workspaces and work on generating intermediate assets. To support such use cases, access control at the user workspace and project workspace granularity is extremely important for presenting a globally consistent view of pertinent data to these artists.

Netflix Drive aims to solve this problem of exposing different namespaces and attaching appropriate access control to help build a scalable, performant, globally distributed platform for storing and retrieving pertinent assets.

Netflix Drive is envisioned to be a Cloud Drive for Studio and Media applications and lends itself to be a generic paved path solution for all content in Netflix.

It exposes a file/folder interface for applications to save their data and an API interface for control operations. Netflix Drive relies on a data store that will be the persistent storage layer for assets, and a metadata store which will provide a relevant mapping from the file system hierarchy to the data store entities. The major pieces, as shown in Fig. 2, are the file system interface, the API interface, and the metadata and data stores. We will delve into these in the following sections.

Fig 2: Netflix Drive components

File interface for Netflix Drive

Creative applications such as Nuke, Maya, Adobe Photoshop store and retrieve content using files and folders. Netflix Drive relies on FUSE (File System In User Space) to provide POSIX files and folders interface to such applications. A FUSE based POSIX interface provides feature customization elasticity, deployment configuration flexibility as well as a standard and seamless file/folder interface. A similar user space abstraction is available for Windows (WinFSP) and MacOS (MacFUSE)

The operations that originate from user, application and system actions on files and folders translate to a well defined set of function and system calls which are forwarded by the Linux Virtual File System Layer (or a pass-through/filter driver in Windows) to the FUSE layer in user space. The resulting metadata and data operations will be implemented by appropriate metadata and data adapters in Netflix Drive.

Fig 3: POSIX interface of Netflix Drive

The POSIX files and folders interface for Netflix Drive is designed as a layered system with the FUSE implementation hooks forming the top layer. This layer will provide entry points for all of the relevant VFS calls that will be implemented. Netflix Drive contains an abstraction layer below FUSE which allows different metadata and data stores to be plugged into the architecture by having their corresponding adapters implement the interface. We will discuss more about the layered architecture in the section below.

API Interface for Netflix Drive

Along with exposing a file interface which will be a hub of all abstractions, Netflix Drive also exposes API and Polled Task interfaces to allow applications and workflow tools to trigger control operations in Netflix Drive.

For example, applications can explicitly use REST endpoints to publish files stored in Netflix Drive to cloud, and later use a REST endpoint to retrieve a subset of the published files from cloud. The API interface can also be used to track the transfers of large files and allows other applications to be built on top of Netflix Drive.

Fig 4: Control interface of Netflix Drive

The Polled Task interface allows studio and media workflow orchestrators to post or dispatch tasks to Netflix Drive instances on disparate workstations or containers. This allows Netflix Drive to be bootstrapped with an empty namespace when the workstation comes up and dynamically project a specific set of assets relevant to the artists’ work sessions or workflow stages. Further these assets can be projected into a namespace of the artist’s or application’s choosing.

Alternatively, workstations/containers can be launched with the assets of interest prefetched at startup. These allow artists and applications to obtain a workstation which already contains relevant files and optionally add and delete asset trees during the work session. For example, artists perform transformative work on files, and use Netflix Drive to store/fetch intermediate results as well as the final copy which can be transformed back into a media asset.

Bootstrapping Netflix Drive

Given the two different modes in which applications can interact with Netflix Drive, now let us discuss how Netflix Drive is bootstrapped.

On startup, Netflix Drive expects a manifest that contains information about the data store, metadata store, and credentials (tied to a user login) to form an instance of namespace hierarchy. A Netflix Drive mount point may contain multiple Netflix Drive namespaces.

A dynamic instance allows Netflix Drive to show a user-selected and user-accessible subset of data from a large corpus of assets. A user instance allows it to act like a Cloud Drive, where users can work on content which is automatically synced in the background periodically to Cloud. On restart on a new machine, the same files and folders will be prefetched from the cloud. We will cover the different namespaces of Netflix Drive in more detail in a subsequent blog post.

Here is an example of a typical bootstrap manifest file.

This image shows a bootstrap manifest json which highlights how Netflix Drive can work with different metadata stores (such as Redis, CockroachDB), and data stores (such as Ceph, S3) and tie them together to provide persistence layer for assets
A sample manifest file.

The manifest is a persistent artifact which renders a user workstation its Netflix Drive personality. It survives instance failures and is able to recreate the same stateful interface on any newly deployed instance.

Metadata and Data Store Abstractions

In order to allow a variety of different metadata stores and data stores to be easily plugged into the architecture, Netflix Drive exposes abstract interfaces for both metadata and data stores. Here is a high level diagram explaining the different layers of abstractions in Netflix Drive

Fig 5: Layered architecture of Netflix Drive

Metadata Store Characteristics

Each file in Netflix Drive would have one or many corresponding metadata nodes, corresponding to different versions of the file. The file system hierarchy would be modeled as a tree in the metadata store where the root node is the top level folder for the application.

Each metadata node will contain several attributes, such as checksum of the file, location of the data, user permissions to access data, file metadata such as size, modification time, etc. A metadata node may also provide support for extended attributes which can be used to model ACLs, symbolic links, or other expressive file system constructs.

Metadata Store may also expose the concept of workspaces, where each user/application can have several workspaces, and can share workspaces with other users/applications. These are higher level constructs that are very useful to Studio applications.

Data Store Characteristics

Netflix Drive relies on a data store that allows streaming bytes into files/objects persisted on the storage media. The data store should expose APIs that allow Netflix Drive to perform I/O operations. The transfer mechanism for transport of bytes is a function of the data store.

In the first manifestation, Netflix Drive is using an object store (such as Amazon S3) as a data store. In order to expose file store-like properties, there were some changes needed in the object store. Each file can be stored as one or more objects. For Studio applications, file sizes may exceed the maximum object size for Cloud Storage, and so, the data store service should have the ability to store multiple parts of a file as separate objects. It is the responsibility of the data store service to tie these objects to a single file and inform the metadata store of the single unique Id for these several object parts. This Data store internally implements the chunking of file into several parts, encrypting of the content, and life cycle management of the data.

Multi-tiered architecture

Netflix Drive allows multiple data stores to be a part of the same installation via its bootstrap manifest.

Fig 6: Multiple data stores of Netflix Drive

Some studio applications such as encoding and transcoding have different I/O characteristics than a typical cloud drive.

Most of the data produced by these applications is ephemeral in nature, and is read often initially. The final encoded copy needs to be persisted and the ephemeral data can be deleted. To serve such applications, Netflix Drive can persist the ephemeral data in storage tiers which are closer to the application that allow lower read latencies and better economies for read request, since cloud storage reads incur an egress cost. Finally, once the encoded copy is prepared, this copy can be persisted by Netflix Drive to a persistent storage tier in the cloud. A single data store may also choose to archive some subset of content stored in cheaper alternatives.

Security

Studio applications require strict adherence to security models where only users or applications with specific permissions should be allowed to access specific assets. Security is one of the cornerstones of Netflix Drive design. Netflix Drive dynamic namespace design allows an artist or workflow to access only a small subset of the assets based on the workspace information and access control and is one of the benefits of using Netflix Drive in Studio workflows. Netflix Drive encapsulates the authentication and authorization models in its metadata store. These are translated into POSIX ACLs in Netflix Drive. In the future, Netflix Drive can allow more expressive ACLs by leveraging extended attributes associated with Metadata nodes corresponding to an asset.

Netflix Drive is currently being used by several Studio teams as the paved path solution for working with assets and is integrated with several media suite applications. As of today, Netflix Drive can be installed on CentOS, MacOS and Windows. In the future blog posts, we will cover implementation details, learnings, performance analysis of Netflix Drive, and some of the applications and workflows built on top of Netflix Drive.

If you are passionate about building Storage and Infrastructure solutions for Netflix Data Platform, we are always looking for talented engineers and managers. Please check out our job listings


Netflix Drive was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Какво ще се промени с нови избори?

Post Syndicated from Bozho original https://blog.bozho.net/blog/3733

Аргументът „защо пак нови избори, нищо няма да се промени“ се среща често. Срещаше се и 2013г, когато протестирахме срещу кабинета на БСП и ДПС. И в двата случая, това е аргумент на тези, които се притесняват от избори. И е неверен по няколко причини.

  • Вече е ясно, че ГЕРБ няма да управлява – на предишните избори не беше ясно – много хора мислеха, че ще има пак някой трик, някое тайно съглашение по нощите и че щяхме да осъмнем с кабинет Борисов 4 или в най-добрия случай ГЕРБ 4 с Борисов, раздаващ задачи от „зайчарника“. Партиите в парламента удържаха обещанието си и без да формират управленска коалиция помежду си, не дадоха власт на ГЕРБ. Това е много значително, защото хората, които гласуват по принцип за победителя и/или за това да има управление, вече нямат причина да гласуват за ГЕРБ, защото те са в изолация и е ясно, че няма да управляват. Това води със себе си освен директно загубените гласове, и косвено загубените от корпоративните мрежи, които са чакали държавни пари по линия на управлението и затова са насочвали ресурс към ГЕРБ – ако те нямат такива гаранции, няма причина да наливат контролираните от тях гласове в празната каца на ГЕРБ. Не казвам, че ще мигрират към други партии, но може да се снишат.
  • Опозиционните партии показаха принципност и конструктивност – Демократична България, Има такъв народ и Изправи се! показаха, че макар да имат фундаментални различия по много теми, могат да бъдат конструктивни по общи цели (т.нар. „тематични мнозинства“) и че не ги влече управлението на всяка цена. При ДБ и ИТН (слагам ДБ на първо място, защото съм част от тази коалиция) това според мен ще донесе допълнителни гласове. ИСМВ имат проблеми с кохерентността на групата си, които могат да нулират позитивите от тези няколко седмици.
  • Промените в Избирателния кодекс ще донесат макар и малка промяна на баланса – към по-честни избори. Повече хора ще имат достъп до това да упражняват правото си на глас (дали в чужбина или след улеснена регистрация по настоящ адрес), а новата ЦИК няма да саботира изборния процес като предходната. Ще отпаднат и изкривяванията, получени заради недействителни бюлетини (дали нарочно направени невалидни или заради леко излизане на химикала от квадратчето). Ще отпаднат грешките при броенето. Няма да могат да бъдат снимани бюлетини и разнасяни бюлетини в „българска нишка“, така че тези, с купен и контролиран вот ще имат теоретичната възможност да го упражнят по-свободно. Всичко това на куп не е фундаментална разлика, защото изборните правила и механики сами по себе си не променят значително нещата, но разлика ще има и тя ще е към по-добро. Към изборния процес добавяме и факта, че вече новите партии ще имат представители в секционните комисии навсякъде и ще могат не просто като наблюдатели и застъпници да гарантират за честността на изборите.
  • Служебният кабинет ще бъде фактор. Не бих възлагал големи надежди който и да е служебен кабинет да свърши твърде много неща, но е ясно настроението на Радев за възпиране на инструментите на ГЕРБ – дори ако МВР си свърши работата и реално противодейства на купуването на гласове, ще е достатъчна промяна. А има хипотетична възможност да направи и повече от това.
  • Летният сезон – избори през лятото и след дълга пандемична зима няма начин да не се отрази на активността. Как ще се разпредели този ефект по различните партии ще видим, но със сигурност за немалко хора финалът на европейското първенство по футбол, гледан с бира на морето, ще е по-интересен от изборна нощ със скучни анализатори. Но от партиите зависи да покажат, че гласуването на тези избори ще бъде дори по-важно от на предходните.

Какви ще бъдат резултатите – не мога да предскажа. Бих допуснал спад при ГЕРБ, покачване при Демократична България и Има такъв народ. Разбира се, това зависи от работата в кампанията – няма нищо дадено и гарантирано. Обединение на патриотичните останки, което обаче далеч не им гарантира прескачане на бариерата. Дали ще може да се сформира управление и какво ще бъде то е твърде рано да се каже, но в тази кампания партиите ще трябва да покажат готовността си да управляват и яснота на управленската си визия. И от това ще зависи колко точно ръст ще имат и колко точно ще отслабне ГЕРБ.

Но резултатът от следващите избори ще бъде различен, ще доведе до различна конфигурация и различни възможности за управление. Какви ще бъдат те – ще кажат избирателите.

Материалът Какво ще се промени с нови избори? е публикуван за пръв път на БЛОГодаря.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close