Tag Archives: open source

Open-Sourcing Metaflow, a Human-Centric Framework for Data Science

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9?source=rss----2615bd06b42e---4

by David Berg, Ravi Kiran Chirravuri, Romain Cledat, Savin Goyal, Ferras Hamad, Ville Tuulos

tl;dr Metaflow is now open-source! Get started at metaflow.org.

Netflix applies data science to hundreds of use cases across the company, including optimizing content delivery and video encoding. Data scientists at Netflix relish our culture that empowers them to work autonomously and use their judgment to solve problems independently. We want our data scientists to be curious and take smart risks that have the potential for high business impact.

About two years ago, we, at our newly formed Machine Learning Infrastructure team started asking our data scientists a question: “What is the hardest thing for you as a data scientist at Netflix?” We were expecting to hear answers related to large-scale data and models, and maybe issues related to modern GPUs. Instead, we heard stories about projects where getting the first version to production took surprisingly long — mainly because of mundane reasons related to software engineering. We heard many stories about difficulties related to data access and basic data processing. We sat in meetings where data scientists discussed with their stakeholders how to best version different versions of their models without impacting production. We saw how excited data scientists were about modern off-the-shelf machine learning libraries, but we also witnessed various issues caused by these libraries when they were casually included as dependencies in production workflows.

We realized that nearly everything that data scientists wanted to do was already doable technically, but nothing was easy enough. Our job as a Machine Learning Infrastructure team would therefore not be mainly about enabling new technical feats. Instead, we should make common operations so easy that data scientists would not even realize that they were difficult before. We would focus our energy solely on improving data scientist productivity by being fanatically human-centric.

How could we improve the quality of life for data scientists? The following picture started emerging:

Our data scientists love the freedom of being able to choose the best modeling approach for their project. They know that feature engineering is critical for many models, so they want to stay in control of model inputs and feature engineering logic. In many cases, data scientists are quite eager to own their own models in production, since it allows them to troubleshoot and iterate the models faster.

On the other hand, very few data scientists feel strongly about the nature of the data warehouse, the compute platform that trains and scores their models, or the workflow scheduler. Preferably, from their point of view, these foundational components should “just work”. If, and when they fail, the error messages should be clear and understandable in the context of their work.

A key observation was that most of our data scientists had nothing against writing Python code. In fact, plain-and-simple Python is quickly becoming the lingua franca of data science, so using Python is preferable to domain specific languages. Data scientists want to retain their freedom to use arbitrary, idiomatic Python code to express their business logic — like they would do in a Jupyter notebook. However, they don’t want to spend too much time thinking about object hierarchies, packaging issues, or dealing with obscure APIs unrelated to their work. The infrastructure should allow them to exercise their freedom as data scientists but it should provide enough guardrails and scaffolding, so they don’t have to worry about software architecture too much.

Introducing Metaflow

These observations motivated Metaflow, our human-centric framework for data science. Over the past two years, Metaflow has been used internally at Netflix to build and manage hundreds of data-science projects from natural language processing to operations research.

By design, Metaflow is a deceptively simple Python library:

Data scientists can structure their workflow as a Directed Acyclic Graph of steps, as depicted above. The steps can be arbitrary Python code. In this hypothetical example, the flow trains two versions of a model in parallel and chooses the one with the highest score.

On the surface, this doesn’t seem like much. There are many existing frameworks, such as Apache Airflow or Luigi, which allow execution of DAGs consisting of arbitrary Python code. The devil is in the many carefully designed details of Metaflow: for instance, note how in the above example data and models are stored as normal Python instance variables. They work even if the code is executed on a distributed compute platform, which Metaflow supports by default, thanks to Metaflow’s built-in content-addressed artifact store. In many other frameworks, loading and storing of artifacts is left as an exercise for the user, which forces them to decide what should and should not be persisted. Metaflow removes this cognitive overhead.

Metaflow is packed with human-centric details like this, all of which aim at boosting data scientist productivity. For a comprehensive overview of all features of Metaflow, take a look at our documentation at docs.metaflow.org.

Metaflow on Amazon Web Services

Netflix’s data warehouse contains hundreds of petabytes of data. While a typical machine learning workflow running on Metaflow touches only a small shard of this warehouse, it can still process terabytes of data.

Metaflow is a cloud-native framework. It leverages elasticity of the cloud by design — both for compute and storage. Netflix has been one of the largest users of Amazon Web Services (AWS) for many years and we have accumulated plenty of operational experience and expertise in dealing with the cloud, AWS in particular. For the open-source release, we partnered with AWS to provide a seamless integration between Metaflow and various AWS services.

Metaflow comes with built-in capability to snapshot all code and data in Amazon S3 automatically, which is a key value proposition of our internal Metaflow setup. This provides us with a comprehensive solution for versioning and experiment tracking without any user intervention, which is core to any production-grade machine learning infrastructure.

In addition, Metaflow comes bundled with a high-performance S3 client, which can load data up to 10Gbps. This client has been massively popular amongst our users, who can now load data into their workflows an order of magnitude faster than before, enabling faster iteration cycles.

For general purpose data processing, Metaflow integrates with AWS Batch, which is a managed, container-based compute platform provided by AWS. The user can benefit from infinitely scalable compute clusters by adding a single line in their code: @batch. For training machine learning models, besides writing their own functions, the user has the choice to use AWS Sagemaker, which provides high-performance implementations of various models, many of which support distributed training.

Metaflow supports all common off-the-shelf machine learning frameworks through our @conda decorator, which allows the user to specify external dependencies for their steps safely. The @conda decorator freezes the execution environment, providing good guarantees of reproducibility, both when executed locally as well as in the cloud.

For more details, read this page about Metaflow’s integration with AWS.

From Prototype To Production

Out of the box, Metaflow provides a first-class local development experience. It allows data scientists to develop and test code quickly on your laptop, similar to any Python script. If your workflow supports parallelism, Metaflow takes advantage of all CPU cores available on your development machine.

We encourage our users to deploy their workflows to production as soon as possible. In our case, “production” means a highly available, centralized DAG scheduler, Meson, where users can export their Metaflow runs for execution with a single command. This allows them to start testing their workflow with regularly updating data quickly, which is a highly effective way to surface bugs and issues in the model. Since Meson is not available in open-source, we are working on providing a similar integration to AWS Step Functions, which is a highly available workflow scheduler.

In a complex business environment like Netflix’s, there are many ways to consume the results of a data science workflow. Often, the final results are written to a table, to be consumed by a dashboard. Sometimes, the resulting model is deployed as a microservice to support real-time inferencing. It is also common to chain workflows so that the results of a workflow are consumed by another. Metaflow supports all these modalities, although some of these features are not yet available in the open-source version.

When it comes to inspecting the results, Metaflow comes with a notebook-friendly client API. Most of our data scientists are heavy users of Jupyter notebooks, so we decided to focus our UI efforts on a seamless integration with notebooks, instead of providing a one-size-fits-all Metaflow UI. Our data scientists can build custom model UIs in notebooks, fetching artifacts from Metaflow, which provide just the right information about each model. A similar experience is available with AWS Sagemaker notebooks with open-source Metaflow.

Get Started With Metaflow

Metaflow has been eagerly adopted inside of Netflix, and today, we are making Metaflow available as an open-source project.

We hope that our vision of data scientist autonomy and productivity resonates outside Netflix as well. We welcome you to try Metaflow, start using it in your organization, and participate in its development.

You can find the project home page at metaflow.org and the code at github.com/Netflix/metaflow. Metaflow is comprehensively documented at docs.metaflow.org. The quickest way to get started is to follow our tutorial. If you want to learn more before getting your hands dirty, you can watch presentations about Metaflow at the high level or dig deeper into the internals of Metaflow.

If you have any questions, thoughts, or comments about Metaflow, you can find us at Metaflow chat room or you can reach us by email at [email protected]. We are eager to hear from you!


Open-Sourcing Metaflow, a Human-Centric Framework for Data Science was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

New – Amazon Managed Apache Cassandra Service (MCS)

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-managed-apache-cassandra-service-mcs/

Managing databases at scale is never easy. One of the options to store, retrieve, and manage large amounts of structured data, including key-value and tabular formats, is Apache Cassandra. With Cassandra, you can use the expressive Cassandra Query Language (CQL) to build applications quickly.

However, managing large Cassandra clusters can be difficult and takes a lot of time. You need specialized expertise to set up, configure, and maintain the underlying infrastructure, and have a deep understanding of the entire application stack, including the Apache Cassandra open source software. You need to add or remove nodes manually, rebalancing partitions, and doing so while keeping your application available with the required performance. Talking with customers, we found out that they often keep their clusters scaled up for peak load because scaling down is complex. To keep your Cassandra cluster updated, you have to do it node by node. It’s hard to backup and restore a cluster if something goes wrong during an update, and you may end up skipping patches or running an outdated version.

Introducing Amazon Managed Cassandra Service
Today, we are launching in open preview Amazon Managed Apache Cassandra Service (MCS), a scalable, highly available, and managed Apache Cassandra-compatible database service. Amazon MCS is serverless, so you pay for only the resources you use and the service automatically scales tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage.

With Amazon MCS, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. Amazon MCS implements the Apache Cassandra version 3.11 CQL API, allowing you to use the code and drivers that you already have in your applications. Updating your application is as easy as changing the endpoint to the one in the Amazon MCS service table.

Amazon MCS provides consistent single-digit-millisecond read and write performance at any scale, so you can build applications with low latency to provide a smooth user experience. You have visibility into how your application is performing using Amazon CloudWatch.

There is no limit on the size of a table or the number of items, and you do not need to provision storage. Data storage is fully managed and highly available. Your table data is replicated automatically three times across multiple AWS Availability Zones for durability.

All customer data is encrypted at rest by default. You can use encryption keys stored in AWS Key Management Service (KMS)Amazon MCS is also integrated with AWS Identity and Access Management (IAM) to help you manage access to your tables and data.

Using Amazon Managed Cassandra Service
You can use Amazon MCS with the console, CQL, or existing Apache 2.0 licensed Cassandra drivers. In the console there is a CQL editor, or you can connect using cqlsh. 

To connect using cqlsh, I need to generate service-specific credentials for an existing IAM user. This is just a command using the AWS Command Line Interface (CLI):

aws iam create-service-specific-credential --user-name USERNAME --service-name mcs.amazonaws.com

{
    "ServiceSpecificCredential": {
        "CreateDate": "2019-11-27T14:36:16Z",
        "ServiceName": "mcs.amazonaws.com",
        "ServiceUserName": "USERNAME-at-123412341234",
        "ServicePassword": "...",
        "ServiceSpecificCredentialId": "...",
        "UserName": "USERNAME",
        "Status": "Active"
    }
}

Amazon MCS only accepts secure connections using TLS.  I download the Amazon root certificate and edit the cqlshrc configuration file to use it. Now, I can connect with:

cqlsh {endpoint} {port} -u {ServiceUserName} -p {ServicePassword} --ssl

First, I create a keyspace. A keyspace contains one or more tables and defines the replication strategy for all the tables it contains. With Amazon MCS the default replication strategy for all keyspaces is the Single-region strategy. It replicates data 3 times across multiple Availability Zones in a single AWS Region.

To create a keyspace I can use the console or CQL. In the Amazon MCS console, I provide the name for the keyspace.

Similarly, I can use CQL to create the bookstore keyspace:

CREATE KEYSPACE IF NOT EXISTS bookstore WITH REPLICATION={'class': 'SingleRegionStrategy'};

Now I create a table. A table is where your data is organized and stored. Again, I can use the console or CQL. From the console, I select the bookstore keyspace and give the table a name.

Below that, I add the columns for my books table. Each row in a table is referenced by a primary key, that can be composed of one or more columns, the values of which determine which partition the data is stored in. In my case the primary key is the ISBN. Optionally, I can add clustering columns, which determine the sort order of records within a partition. I am not using clustering columns for this table.

Alternatively, using CQL, I can create the table with the following commands:

USE bookstore;

CREATE TABLE IF NOT EXISTS books
(isbn text PRIMARY KEY,
title text,
author text,
pages int,
year_of_publication int);

I now use CQL to insert a record in the books table:

INSERT INTO books (isbn, title, author, pages, year_of_publication)
VALUES ('978-0201896831',
'The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)',
'Donald E. Knuth', 672, 1997);

Let’s run a quick query. In the console, I select the books table and then Query table.

In the CQL Editor, I use the default query and select Run command.

By default, I see the result of the query in table view:

If I prefer, I can see the result in JSON format, similar to what an application using the Cassandra API would see:

To insert more records, I use csqlsh again and upload some data from a local CSV file:

COPY books (isbn, title, author, pages, year_of_publication)
FROM './books.csv' WITH delimiter=',' AND header=TRUE;

Now I look again at the content of the books table:

SELECT * FROM books;

I can select a row using a primary key, or use filtering for additional conditions. For example:

SELECT title FROM books WHERE isbn='978-1942788713';

SELECT title FROM books WHERE author='Scott Page' ALLOW FILTERING;

With Amazon MCS you can use existing Apache Cassandra 2.0–licensed drivers and developer tools. Open-source Cassandra drivers are available for Java, Python, Ruby, .NET, Node.js, PHP, C++, Perl, and Go.

You can learn more in the Amazon MCS documentation.

Available in Open Preview
Amazon MCS is available today in open preview in US East (N. Virginia), US East (Ohio), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Tokyo).

As we work with the Cassandra API libraries, we are contributing bug fixes to the open source Apache Cassandra project. We are also contributing back improvements such as built-in support for AWS authentication (SigV4), which simplifies managing credentials for customers running Cassandra on Amazon Elastic Compute Cloud (EC2), since EC2 and IAM can handle distribution and management of credentials using instance roles automatically. We are also announcing the funding of AWS promotional service credits for testing Cassandra-related open-source projects. To learn more about these contributions, visit the Open Source blog.

During the preview, you can use Amazon MCS with on-demand capacity. At general availability, we will also offer the option to use provisioned throughput for more predictable workloads. With on-demand capacity mode, Amazon MCS charges you based on the amount of data your applications read and write from your tables. You do not need to specify how much read and write throughput capacity to provision to your tables because Amazon MCS accommodates your workloads instantly as they scale up or down.

As part of the AWS Free Tier, you can get started with Amazon MCS for free. For the first three months, you are offered a monthly free tier of 30 million write request units, 30 million read request units, and 1 GB of storage. Your free tier starts when you create your first Amazon MCS resource.

Next year we are making it easier to migrate your data Amazon MCS, adding support to use AWS Database Migration Service.

Amazon MCS makes it easy to use Cassandra workloads at any scale, providing a simple programming interface to build new applications, or migrate existing ones. I can’t wait to see what are you going to use it for!

Growth Monitor pi: an open monitoring system for plant science

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/growth-monitor-pi-an-open-monitoring-system-for-plant-science/

Plant scientists and agronomists use growth chambers to provide consistent growing conditions for the plants they study. This reduces confounding variables – inconsistent temperature or light levels, for example – that could render the results of their experiments less meaningful. To make sure that conditions really are consistent both within and between growth chambers, which minimises experimental bias and ensures that experiments are reproducible, it’s helpful to monitor and record environmental variables in the chambers.

A neat grid of small leafy plants on a black plastic tray. Metal housing and tubing is visible to the sides.

Arabidopsis thaliana in a growth chamber on the International Space Station. Many experimental plants are less well monitored than these ones.
(“Arabidopsis thaliana plants […]” by Rawpixel Ltd (original by NASA) / CC BY 2.0)

In a recent paper in Applications in Plant Sciences, Brandin Grindstaff and colleagues at the universities of Missouri and Arizona describe how they developed Growth Monitor pi, or GMpi: an affordable growth chamber monitor that provides wider functionality than other devices. As well as sensing growth conditions, it sends the gathered data to cloud storage, captures images, and generates alerts to inform scientists when conditions drift outside of an acceptable range.

The authors emphasise – and we heartily agree – that you don’t need expertise with software and computing to build, use, and adapt a system like this. They’ve written a detailed protocol and made available all the necessary software for any researcher to build GMpi, and they note that commercial solutions with similar functionality range in price from $10,000 to $1,000,000 – something of an incentive to give the DIY approach a go.

GMpi uses a Raspberry Pi Model 3B+, to which are connected temperature-humidity and light sensors from our friends at Adafruit, as well as a Raspberry Pi Camera Module.

The team used open-source app Rclone to upload sensor data to a cloud service, choosing Google Drive since it’s available for free. To alert users when growing conditions fall outside of a set range, they use the incoming webhooks app to generate notifications in a Slack channel. Sensor operation, data gathering, and remote monitoring are supported by a combination of software that’s available for free from the open-source community and software the authors developed themselves. Their package GMPi_Pack is available on GitHub.

With a bill of materials amounting to something in the region of $200, GMpi is another excellent example of affordable, accessible, customisable open labware that’s available to researchers and students. If you want to find out how to build GMpi for your lab, or just for your greenhouse, Affordable remote monitoring of plant growth in facilities using Raspberry Pi computers by Brandin et al. is available on PubMed Central, and it includes appendices with clear and detailed set-up instructions for the whole system.

The post Growth Monitor pi: an open monitoring system for plant science appeared first on Raspberry Pi.

Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2

Post Syndicated from Amanda Pinsker original https://github.blog/2019-10-02-get-started-easier-with-github-desktop-2-2/

Anyone who uses Git knows that it has a steep learning curve. We’ve learned from developers that most people tend to learn from a buddy, whether that’s a coworker, a professor, a friend, or even a YouTube video. In GitHub Desktop 2.2, we’re releasing the first version of an interactive Git and GitHub tutorial that can be your buddy and help you get started. If you’re new to Desktop, you can download and try out the tutorial at desktop.github.com.

Get set up

To get set up, we help you through two major pieces: creating a repository and connecting an editor. When you first open Desktop, a welcome page appears with a new option to “Create a Tutorial Repository”. Starting with this option creates a tutorial repository that guides you through the core concepts of working with Git using GitHub Desktop.

There are a lot of tools you need to get started with Git and GitHub. The most important of these is your code editor. In the first step of the tutorial, you’re prompted to install an editor if you don’t have one already.

Learn the GitHub flow

Next, we guide you through how to use GitHub Desktop to make changes to code locally and get your work on GitHub. You’ll create a new branch, make a change to a file, commit it, push it to GitHub, and open your first pull request.

We’ve also heard that new users initially experience confusion between Git, GitHub, and GitHub Desktop. We cover these differences in the tutorial and make sure to reinforce the explanations.

Keep going with your own project

In GitHub Desktop 1.6, we introduced suggested next steps based on the state of your repository. Now when you complete the tutorial, we similarly suggest next steps: exploring projects on GitHub that you might want to contribute to, creating a new project, or adding an existing project to Desktop. We always want GitHub Desktop to be the tool that makes your next steps clear, whether you’re in the flow of your work, or you’re a new developer just getting started.

What’s next?

With GitHub Desktop 2.2, we’re making the product our users love more approachable to newcomers. We’ll be iterating on the tutorial based on your feedback, and we’ll continue to build on the connection between GitHub and your local machine. If you want to start building something but don’t know how, think of GitHub Desktop as your buddy to help you get started.

Learn more about GitHub Desktop

The post Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2 appeared first on The GitHub Blog.

Accelerating the GitHub Sponsors beta with Stripe Connect

Post Syndicated from Katie Delfin original https://github.blog/2019-09-10-accelerating-the-github-sponsors-beta/

Update: GitHub Sponsors is available in every country where GitHub does business, not just the 30 countries supported by Stripe Connect. We’ll continue to use our existing manual payout system for anyone outside of Connect’s list of currently supported countries.


Back in May, we announced GitHub Sponsors, a new way to support the developers who build and maintain the open source software you use every day. We launched in beta as early as possible so we could work closely with the community to address feedback and understand their needs before adding more developers to the program. Today, we’re taking the next step in accelerating the availability of GitHub Sponsors through a new streamlined onboarding and payment experience with Stripe Connect.

The early days of manual payments

Within hours of announcing GitHub Sponsors, thousands of people signed up for the waitlist from all over the world. It was awesome to see so much excitement for the program, but we knew this meant we had a lot of work ahead of us. 

We started the beta with a small group of sponsored developers, and every few weeks, we invited more into the program. Everything was a manual process—setting up account information, verifying identity, waiting for approval, running reports, and processing payouts across multiple departments. Due to the manual nature of the process, we were limited to a small number of people in the program until we could automate and streamline our operations.

Onboarding more developers, faster

At GitHub, we do business with companies of all sizes, from Fortune 50 companies to early stage startups from all over the world. Collecting revenue globally is something we’ve done for years, but with GitHub Sponsors, it’s not just about receiving money—it’s also about paying it out. This means providing a seamless and secure experience for sponsored developers to verify their identities, enter banking information, receive funds, and manage payouts. We also know that, for many maintainers, this will be the first time they’ve been financially rewarded for their contributions to open source. We want to ensure the process is easy for developers to complete, so they can spend more time doing what they love, like contributing to open source.

Stripe Connect Express offers a simple, low overhead onboarding experience that complies with web accessibility guidelines, coupled with the ability to localize the experience to simplify payouts in countries with unique tax laws and compliance regulations. And by not building our own onboarding solution, we’re able to save months of engineering and maintenance time, and start onboarding more sponsored developers today.

Localization

One of Stripe Connect’s features that’s helped us deliver a great experience to our users is their localization support. Stripe just released international support for 30 countries (with more to come) for Connect Express. With this expanded support, Stripe takes care of adjusting for localized rules and regulations for supported countries—no small feat. We couldn’t be more thrilled to be one of the first to offer support across all 30 countries to serve our global community.

Our maintainer onboarding process can now start to keep up with the growing demand of the program. With Stripe, users can onboard using Connect Express, reducing onboarding time from a week-long process to under five minutes.  

And there’s much more to come. While we’re just under four months into the program, we’re focused on making GitHub Sponsors available to even more developers around the world.

Sign up to join the beta—you’ll hear from us soon.

Learn more about GitHub Sponsors

The post Accelerating the GitHub Sponsors beta with Stripe Connect appeared first on The GitHub Blog.

Running GitHub on Rails 6.0

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2019-09-09-running-github-on-rails-6-0/

On August 26, 2019, the GitHub application was deployed to production with 100 percent of traffic on the newest Rails version: 6.0. This change came just 1.5 weeks after the final release of Rails 6.0. Rails upgrades aren’t always something companies announce, but looking back at GitHub’s history of being on a custom fork of Rails 3.2, this upgrade is a big deal. It represents how far we’ve come in the last few years, and the hard work and dedication from our upgrade team made it smoother, easier, and faster than any of our previous upgrades.

At GitHub, we have a lot to celebrate with the release of Rails 6.0 and the subsequent production deploy. First, we were more involved in this release than we have been in any previous release of Rails. GitHub engineers sent over 100 pull requests to Rails 6.0 to improve documentation, fix bugs, add features, and speed up performance. For many GitHub contributors, this was the first time sending changes to the Rails framework, demonstrating that upgrading Rails not only helps GitHub internally, but also improves our developer community as well.

Second, we deployed Rails 6.0 to production without any negative impact to customers—we had only one Rails 6.0 exception occur during testing, and it was hit by a bot! We were able to achieve this level of stability for the upgrade because we were heavily involved with its development. As soon as we finished the Rails 5.2 upgrade last year, we started upgrading our application to Rails 6.0.

Instead of waiting for the final release, we’d upgrade every week by pulling in the latest changes from Rails master and run all of our tests against that new version. This allowed us to find regressions quickly and early—often finding regressions in Rails master just hours after they were introduced. Upgrading weekly made it easy to find where these regressions were introduced since we were bisecting Rails with only a week’s worth of commits instead of more than a year of commits. Once our build for Rails 6.0 was green, we’d merge the pull request to master, and all new code that went into GitHub would need to pass in Rails 5.2 and the newest master build of Rails. Upgrading every week worked so well that we’ll continue using this process for upgrading from 6.0 to 6.1.

In addition to ensuring that Rails 6.0 was stable, we also contributed to the new features of the framework like parallel testing and multiple databases. The code for these tools is used in our production application every day—it’s well-tested, GitHub-approved code in a public, open source framework. By upstreaming this tooling, we’re able to reduce complexity in our code base and set a standard where so many companies once had to implement this functionality on their own.

There are so many wins to staying upgraded that go beyond more security, faster performance, and new features. By staying current with Rails master, we’re influencing the future of the framework to meet our needs and giving back to the open source community in big ways. This process means that the GitHub code base evolves alongside Rails instead of in response to Rails. Investing in our application by staying up to date with the Rails framework has had a tremendous positive effect on our code base and engineering teams. Staying current allows us to invest in our community, invest in our tools for the long term, and improve the experience of working with the GitHub code base for our engineers.

Keep a look out for all the great stuff we’ll be contributing to Rails 6.1 and beyond—this is just the beginning.

The post Running GitHub on Rails 6.0 appeared first on The GitHub Blog.

Re-Architecting the Video Gatekeeper

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/re-architecting-the-video-gatekeeper-f7b0ac2f6b00?source=rss----2615bd06b42e---4

By Drew Koszewnik

This is the story about how the Content Setup Engineering team used Hollow, a Netflix OSS technology, to re-architect and simplify an essential component in our content pipeline — delivering a large amount of business value in the process.

The Context

Each movie and show on the Netflix service is carefully curated to ensure an optimal viewing experience. The team responsible for this curation is Title Operations. Title Operations will confirm, among other things:

  • We are in compliance with the contracts — date ranges and places where we can show a video are set up correctly for each title
  • Video with captions, subtitles, and secondary audio “dub” assets are sourced, translated, and made available to the right populations around the world
  • Title name and synopsis are available and translated
  • The appropriate maturity ratings are available for each country

When a title meets all of the minimum above requirements, then it is allowed to go live on the service. Gatekeeper is the system at Netflix responsible for evaluating the “liveness” of videos and assets on the site. A title doesn’t become visible to members until Gatekeeper approves it — and if it can’t validate the setup, then it will assist Title Operations by pointing out what’s missing from the baseline customer experience.

Gatekeeper accomplishes its prescribed task by aggregating data from multiple upstream systems, applying some business logic, then producing an output detailing the status of each video in each country.

The Tech

Hollow, an OSS technology we released a few years ago, has been best described as a total high-density near cache:

  • Total: The entire dataset is cached on each node — there is no eviction policy, and there are no cache misses.
  • High-Density: encoding, bit-packing, and deduplication techniques are employed to optimize the memory footprint of the dataset.
  • Near: the cache exists in RAM on any instance which requires access to the dataset.

One exciting thing about the total nature of this technology — because we don’t have to worry about swapping records in-and-out of memory, we can make assumptions and do some precomputation of the in-memory representation of the dataset which would not otherwise be possible. The net result is, for many datasets, vastly more efficient use of RAM. Whereas with a traditional partial-cache solution you may wonder whether you can get away with caching only 5% of the dataset, or if you need to reserve enough space for 10% in order to get an acceptable hit/miss ratio — with the same amount of memory Hollow may be able to cache 100% of your dataset and achieve a 100% hit rate.

And obviously, if you get a 100% hit rate, you eliminate all I/O required to access your data — and can achieve orders of magnitude more efficient data access, which opens up many possibilities.

The Status-Quo

Until very recently, Gatekeeper was a completely event-driven system. When a change for a video occurred in any one of its upstream systems, that system would send an event to Gatekeeper. Gatekeeper would react to that event by reaching into each of its upstream services, gathering the necessary input data to evaluate the liveness of the video and its associated assets. It would then produce a single-record output detailing the status of that single video.

Old Gatekeeper Architecture

This model had several problems associated with it:

  • This process was completely I/O bound and put a lot of load on upstream systems.
  • Consequently, these events would queue up throughout the day and cause processing delays, which meant that titles may not actually go live on time.
  • Worse, events would occasionally get missed, meaning titles wouldn’t go live at all until someone from Title Operations realized there was a problem.

The mitigation for these issues was to “sweep” the catalog so Videos matching specific criteria (e.g., scheduled to launch next week) would get events automatically injected into the processing queue. Unfortunately, this mitigation added many more events into the queue, which exacerbated the problem.

Clearly, a change in direction was necessary.

The Idea

We decided to employ a total high-density near cache (i.e., Hollow) to eliminate our I/O bottlenecks. For each of our upstream systems, we would create a Hollow dataset which encompasses all of the data necessary for Gatekeeper to perform its evaluation. Each upstream system would now be responsible for keeping its cache updated.

New Gatekeeper Architecture

With this model, liveness evaluation is conceptually separated from the data retrieval from upstream systems. Instead of reacting to events, Gatekeeper would continuously process liveness for all assets in all videos across all countries in a repeating cycle. The cycle iterates over every video available at Netflix, calculating liveness details for each of them. At the end of each cycle, it produces a complete output (also a Hollow dataset) representing the liveness status details of all videos in all countries.

We expected that this continuous processing model was possible because a complete removal of our I/O bottlenecks would mean that we should be able to operate orders of magnitude more efficiently. We also expected that by moving to this model, we would realize many positive effects for the business.

  • A definitive solution for the excess load on upstream systems generated by Gatekeeper
  • A complete elimination of liveness processing delays and missed go-live dates.
  • A reduction in the time the Content Setup Engineering team spends on performance-related issues.
  • Improved debuggability and visibility into liveness processing.

The Problem

Hollow can also be thought of like a time machine. As a dataset changes over time, it communicates those changes to consumers by breaking the timeline down into a series of discrete data states. Each data state represents a snapshot of the entire dataset at a specific moment in time.

Hollow is like a time machine

Usually, consumers of a Hollow dataset are loading the latest data state and keeping their cache updated as new states are produced. However, they may instead point to a prior state — which will revert their view of the entire dataset to a point in the past.

The traditional method of producing data states is to maintain a single producer which runs a repeating cycle. During that cycle, the producer iterates over all records from the source of truth. As it iterates, it adds each record to the Hollow library. Hollow then calculates the differences between the data added during this cycle and the data added during the last cycle, then publishes the state to a location known to consumers.

Traditional Hollow usage

The problem with this total-source-of-truth iteration model is that it can take a long time. In the case of some of our upstream systems, this could take hours. This data-propagation latency was unacceptable — we can’t wait hours for liveness processing if, for example, Title Operations adds a rating to a movie that needs to go live imminently.

The Improvement

What we needed was a faster time machine — one which could produce states with a more frequent cadence, so that changes could be more quickly realized by consumers.

Incremental Hollow is like a faster time machine

To achieve this, we created an incremental Hollow infrastructure for Netflix, leveraging work which had been done in the Hollow library earlier, and pioneered in production usage by the Streaming Platform Team at Target (and is now a public non-beta API).

With this infrastructure, each time a change is detected in a source application, the updated record is encoded and emitted to a Kafka topic. A new component that is not part of the source application, the Hollow Incremental Producer service, performs a repeating cycle at a predefined cadence. During each cycle, it reads all messages which have been added to the topic since the last cycle and mutates the Hollow state engine to reflect the new state of the updated records.

If a message from the Kafka topic contains the exact same data as already reflected in the Hollow dataset, no action is taken.

Hollow Incremental Producer Service

To mitigate issues arising from missed events, we implement a sweep mechanism that periodically iterates over an entire source dataset. As it iterates, it emits the content of each record to the Kafka topic. In this way, any updates which may have been missed will eventually be reflected in the Hollow dataset. Additionally, because this is not the primary mechanism by which updates are propagated to the Hollow dataset, this does not have to be run as quickly or frequently as a cycle must iterate the source in traditional Hollow usage.

The Hollow Incremental Producer is capable of reading a great many messages from the Kafka topic and mutating its Hollow state internally very quickly — so we can configure its cycle times to be very short (we are currently defaulting this to 30 seconds).

This is how we built a faster time machine. Now, if Title Operations adds a maturity rating to a movie, within 30 seconds, that data is available in the corresponding Hollow dataset.

The Tangible Result

With the data propagation latency issue solved, we were able to re-implement the Gatekeeper system to eliminate all I/O boundaries. With the prior implementation of Gatekeeper, re-evaluating all assets for all videos in all countries would have been unthinkable — it would tie up the entire content pipeline for more than a week (and we would then still be behind by a week since nothing else could be processed in the meantime). Now we re-evaluate everything in about 30 seconds — and we do that every minute.

There is no such thing as a missed or delayed liveness evaluation any longer, and the disablement of the prior Gatekeeper system reduced the load on our upstream systems — in some cases by up to 80%.

Load reduction on one upstream system

In addition to these performance benefits, we also get a resiliency benefit. In the prior Gatekeeper system, if one of the upstream services went down, we were unable to evaluate liveness at all because we were unable to retrieve any data from that system. In the new implementation, if one of the upstream systems goes down then it does stop publishing — but we still gate stale data for its corresponding dataset while all others make progress. So for example, if the translated synopsis system goes down, we can still bring a movie on-site in a region if it was held back for, and then receives, the correct subtitles.

The Intangible Result

Perhaps even more beneficial than the performance gains has been the improvement in our development velocity in this system. We can now develop, validate, and release changes in minutes which might have before taken days or weeks — and we can do so with significantly increased release quality.

The time-machine aspect of Hollow means that every deterministic process which uses Hollow exclusively as input data is 100% reproducible. For Gatekeeper, this means that an exact replay of what happened at time X can be accomplished by reverting all of our input states to time X, then re-evaluating everything again.

We use this fact to iterate quickly on changes to the Gatekeeper business logic. We maintain a PREPROD Gatekeeper instance which “follows” our PROD Gatekeeper instance. PREPROD is also continuously evaluating liveness for the entire catalog, but publishing its output to a different Hollow dataset. At the beginning of each cycle, the PREPROD environment will gather the latest produced state from PROD, and set each of its input datasets to the exact same versions which were used to produce the PROD output.

The PREPROD Gatekeeper instance “follows” the PROD instance

When we want to make a change to the Gatekeeper business logic, we do so and then publish it to our PREPROD cluster. The subsequent output state from PREPROD can be diffed with its corresponding output state from PROD to view the precise effect that the logic change will cause. In this way, at a glance, we can validate that our changes have precisely the intended effect, and zero unintended consequences.

A Hollow diff shows exactly what changes

This, coupled with some iteration on the deployment process, has resulted in the ability for our team to code, validate, and deploy impactful changes to Gatekeeper in literally minutes — at least an order of magnitude faster than in the prior system — and we can do so with a higher level of safety than was possible in the previous architecture.

Conclusion

This new implementation of the Gatekeeper system opens up opportunities to capture additional business value, which we plan to pursue over the coming quarters. Additionally, this is a pattern that can be replicated to other systems within the Content Engineering space and elsewhere at Netflix — already a couple of follow-up projects have been launched to formalize and capitalize on the benefits of this n-hollow-input, one-hollow-output architecture.

Content Setup Engineering is an exciting space right now, especially as we scale up our pipeline to produce more content with each passing quarter. We have many opportunities to solve real problems and provide massive value to the business — and to do so with a deep focus on computer science, using and often pioneering leading-edge technologies. If this kind of work sounds appealing to you, reach out to Ivan to get the ball rolling.


Re-Architecting the Video Gatekeeper was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exodus Forks Show That Open Source Kodi Add-ons Are Hard to Eradicate

Post Syndicated from Ernesto original https://torrentfreak.com/exodus-forks-show-that-open-source-kodi-add-ons-are-hard-to-eradicate-190428/

When the pirate streaming box hype reached new heights early 2017, the third-party Kodi add-on “Exodus” was at the center of the action.

Exodus was widely praised as one of the most useful add-ons to access streaming video. This included many pirated movies and TV-shows.

The open source software was maintained by “Lambda,” one of the most prolific developers in the community. However, this meant that when rightsholders started to tighten the screws, Exodus became one of the main targets.

It all started when the popular add-on repository TVAddons mysteriously disappeared. Since Exodus was distributed through the repository, many people experienced trouble updating it.

Initially, it was unknown what was going on with TVAddons but when the site returned more than a month later, it became clear that it was being sued by Bell Canada, TVA, Videotron, and Rogers. This complaint also listed Exodus, alongside 17 other add-ons.

Not much later, development of the Exodus add-on was discontinued. This meant that from one day to another, millions of users found out that their pirate streaming boxes had become useless. At least, in their more recent configuration.

It didn’t take long before others stepped up to fill this void. Interestingly, many of the Exodus alternatives were based on the original Exodus code, which was open source. Even today, nearly two years after the add-on was discontinued, its code lives on.

TVAddons recently published an overview of the various Exodus ‘forks’ that are still online today.

The top one appears to be the aptly named “Exodus Redux,” which is available through GitHub and maintained by a developer known as I-A-C.

However, there are many more add-ons based on the same code. This includes “Yoda,” “Exodus 8,” “Overeasy,” and “13Clowns,” to name a few. All of these allow users to stream video through an easy-to-use interface.

While the open source code is easy to fork, these add-ons can’t operate with complete impunity, of course. Several other Exodus based add-ons have already been discontinued, often following pressure from groups such as anti-piracy group ACE.

The Covenant add-on, developed by Team Colossus, threw in the towel after one of the main developers received a house visit, for example,. The Placenta add-on was discontinued following a cease and desist letter.

This begs the question: if new forks keep appearing, does it mean that rightsholders’ actions are futile?

According to TVAddons, which has banned these forks from its own platform, takedown efforts may help in the short term. However, when open source software is taken down, many alternate versions usually pop-up.

“The Rights holders efforts to destroy dual-use technologies seem to be effective in the very short-term. However, those enforcements only result in software and tools being spread out in a way that becomes uncontrollable in the long term, as we’ve seen with Kodi addons,” a TVAddons spokesperson told us.

In theory, this is indeed true. TVAddons listed just seven active Exodus forks, but there are many more out there. It’s a problem that’s hard to eradicate. 

However, the continued efforts from rightsholders to shut down these add-ons may have a more subtle effect. While hardcore pirates will always find a new fork, there’s also a group of people who will get frustrated by the repeated shutdowns, and give up eventually. 

If we take a look at the popularity of the Google search term “Kodi add-ons” we see that interest started to drop after the major enforcement efforts started. This may be a coincidence of course, but it could also be a sign of people giving up. 

Google searches for “Kodi add-ons”

It’s hard to deny that open source software can’t be easily eradicated, but the ease of access also play a role. 

We’ve also seen that with other popular open source applications, such as Popcorn Time. When one of the most popular forks was taken out following pressure from Hollywood, others remained available. Still, as time went on, interest began to wane. 

Similarly, when Limewire shut down years ago, the Frostwire fork remained available. However, this never reached the same audience as its predecessor. 

All in all, it’s safe to conclude that, while Exodus has left the scene a long time ago, its code still thrives. Whether the total audience is still as large as it once was, remains a question.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Spinnaker Sets Sail to the Continuous Delivery Foundation

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/spinnaker-sets-sail-to-the-continuous-delivery-foundation-e81cd2cbbfeb?source=rss----2615bd06b42e---4

Author: Andy Glover

Since releasing Spinnaker to the open source community in 2015, the platform has flourished with the addition of new cloud providers, triggers, pipeline stages, and much more. Myriad new features, improvements, and innovations have been added by an ever growing, actively engaged community. Each new innovation has been a step towards an even better Continuous Delivery platform that facilitates rapid, reliable, safe delivery of flexible assets to pluggable deployment targets.

Over the last year, Netflix has improved overall management of Spinnaker by enhancing community engagement and transparency. At the Spinnaker Summit in 2018, we announced that we had adopted a formalized project governance plan with Google. Moreover, we also realized that we’ll need to share the responsibility of Spinnaker’s direction as well as yield a level of long-term strategic influence over the project so as to maintain a healthy, engaged community. This means enabling more parties outside of Netflix and Google to have a say in the direction and implementation of Spinnaker.

A strong, healthy, committed community benefits everyone; however, open source projects rarely reach this critical mass. It’s clear Spinnaker has reached this special stage in its evolution; accordingly, we are thrilled to announce two exciting developments.

First, Netflix and Google are jointly donating Spinnaker to the newly created Continuous Delivery Foundation (or CDF), which is part of the Linux Foundation. The CDF is a neutral organization that will grow and sustain an open continuous delivery ecosystem, much like the Cloud Native Computing Foundation (or CNCF) has done for the cloud native computing ecosystem. The initial set of projects to be donated to the CDF are Jenkins, Jenkins X, Spinnaker, and Tekton. Second, Netflix is joining as a founding member of the CDF. Continuous Delivery powers innovation at Netflix and working with other leading practitioners to promote Continuous Delivery through specifications is an exciting opportunity to join forces and bring the benefits of rapid, reliable, and safe delivery to an even larger community.

Spinnaker’s success is in large part due to the amazing community of companies and people that use it and contribute to it. Donating Spinnaker to the CDF will strengthen this community. This move will encourage contributions and investments from additional companies who are undoubtedly waiting on the sidelines. Opening the doors to new companies increases the innovations we’ll see in Spinnaker, which benefits everyone.

Donating Spinnaker to the CDF doesn’t change Netflix’s commitment to Spinnaker, and what’s more, current users of Spinnaker are unaffected by this change. Spinnaker’s previously defined governance policy remains in place. Overtime, new stakeholders will emerge and play a larger, more formal role in shaping Spinnaker’s future. The prospects of an even healthier and more engaged community focused on Spinnaker and the manifold benefits of Continuous Delivery is tremendously exciting and we’re looking forward to seeing it continue to flourish.


Spinnaker Sets Sail to the Continuous Delivery Foundation was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

New – Open Distro for Elasticsearch

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-open-distro-for-elasticsearch/

Elasticsearch is a distributed, document-oriented search and analytics engine. It supports structured and unstructured queries, and does not require a schema to be defined ahead of time. Elasticsearch can be used as a search engine, and is often used for web-scale log analytics, real-time application monitoring, and clickstream analytics.

Originally launched as a true open source project, some of the more recent additions to Elasticsearch are proprietary. My colleague Adrian explains our motivation to start Open Distro for Elasticsearch in his post, Keeping Open Source Open. As strong believers in, and supporters of, open source software, we believe this project will help continue to accelerate open source Elasticsearch innovation.

Open Distro for Elasticsearch
Today we are launching Open Distro for Elasticsearch. This is a value-added distribution of Elasticsearch that is 100% open source (Apache 2.0 license) and supported by AWS. Open Distro for Elasticsearch leverages the open source code for Elasticsearch and Kibana. This is not a fork; we will continue to send our contributions and patches upstream to advance these projects.

In addition to Elasticsearch and Kibana, the first release includes a set of advanced security, event monitoring & alerting, performance analysis, and SQL query features (more on those in a bit). In addition to the source code repo, Open Distro for Elasticsearch and Kibana are available as RPM and Docker containers, with separate downloads for the SQL JDBC and the PerfTop CLI. You can run this code on your laptop, in your data center, or in the cloud.

Contributions are welcome, as are bug reports and feature requests.

Inside Open Distro for Elasticsearch
Let’s take a quick look at the features that we are including in Open Distro for Elasticsearch. Some of these are currently available in Amazon Elasticsearch Service; others will become available in future updates.

Security – This plugin that supports node-to-node encryption, five types of authentication (basic, Active Directory, LDAP, Kerberos, and SAML), role-based access controls at multiple levels (clusters, indices, documents, and fields), audit logging, and cross-cluster search so that any node in a cluster can run search requests across other nodes in the cluster. Learn More

Event Monitoring & Alerting – This feature notifies you when data from one or more Elasticsearch indices meets certain conditions. You could, for example, notify a Slack channel if an application logs more than five HTTP 503 errors in an hour. Monitoring is based on jobs that run on a defined schedule, checking indices against trigger conditions, and raising alerts when a condition has been triggered. Learn More

Deep Performance Analysis – This is a REST API that allows you to query a long list of performance metrics for your cluster. You can access the metrics programmatically or you can visualize them using perf top and other perf tools. Learn More

SQL Support – This feature allows you to query your cluster using SQL statements. It is an improved version of the elasticsearch-sql plugin, and supports a rich set of statements.

This is just the beginning; we have more in the works, and also look forward to your contributions and suggestions!

Jeff;

 

Getting started with the AWS Cloud Development Kit for Amazon ECS

Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/getting-started-with-the-aws-cloud-development-kit-for-amazon-ecs/

The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to define cloud infrastructure in code and provision it through AWS CloudFormation. The AWS CDK integrates fully with AWS services and offers a higher-level object-oriented abstraction to define AWS resources imperatively.

Using the AWS CDK library of infrastructure constructs, you can easily encapsulate AWS best practices in your infrastructure definition and share it without worrying about boilerplate logic. The AWS CDK improves the end-to-end development experience because you get to use the power of modern programming languages to define your AWS infrastructure in a predictable and efficient manner. The AWS CDK is currently available for TypeScript, JavaScript, Java, and .NET.

The AWS CDK now includes constructs for ECS resources, allowing you to deploy a fully functioning containerized application environment on AWS with just a few lines of simple, readable code. Here’s how it works.

Install the AWS CDK

The first step is to install the AWS CDK on your development machine:

mkdir greeter-cdk
cd greeter-cdk
npm init -y
npm install @aws-cdk/cdk
npm install -g aws-cdk

Next, write some JavaScript code that imports the AWS CDK library and uses it to define a skeleton that you can place all your resources in:

index.js

const cdk = require('@aws-cdk/cdk');

class GreetingStack extends cdk.Stack {
  constructor(parent, id, props) {
    super(parent, id, props);
  }
}

class GreetingApp extends cdk.App {
  constructor(argv) {
    super(argv);
    new GreetingStack(this, 'greeting-stack');
  }
}

new GreetingApp().run();

Next, write a small configuration file telling the AWS CDK CLI that index.js is the code that defines your application stack:

cdk.json

{
  "app": "node index.js"
}

Now if you run cdk ls –l, you can see that the AWS CDK has found your stack, and has automatically interpolated some details about it from your development machine’s environment, such as your AWS account ID and default Region:

$ cdk ls -l
- name: greeting-stack
  environment:
    name: 209640446841/us-east-1
    account: '209640446841'
    region: us-east-1

Add ECS constructs

It’s time to add some ECS constructs to your stack. To do this, first install the ECS construct library. Also, install a couple of other constructs to help you set up resources linked to your containers:

npm install @aws-cdk/aws-ecs
npm install @aws-cdk/aws-ec2
npm install @aws-cdk/aws-elasticloadbalancingv2

Now it’s time to use the ECS constructs to set up your application environment:

const cdk = require('@aws-cdk/cdk');
const ecs = require('@aws-cdk/aws-ecs');
const ec2 = require('@aws-cdk/aws-ec2');

class GreetingStack extends cdk.Stack {
  constructor(parent, id, props) {
    super(parent, id, props);

    const vpc = new ec2.VpcNetwork(this, 'GreetingVpc', { maxAZs: 2 });

    // Create an ECS cluster
    const cluster = new ecs.Cluster(this, 'Cluster', { vpc });

    // Add capacity to it
    cluster.addDefaultAutoScalingGroupCapacity({
      instanceType: new ec2.InstanceType('t3.xlarge'),
      instanceCount: 3
    });
  }
}

With just three calls, you can create a VPC to hold all your application resources, and an ECS cluster with three t3.xlarge instances. All it takes is one command to tell the AWS CDK to automatically deploy this stack on your account:

cdk deploy

Behind the scenes, the AWS CDK synthesizes your JavaScript calls into a CloudFormation template. It asks CloudFormation to deploy the resources described in the synthesized template. You can see a live log of each resource that is being created and what the status is. As you can see from the numbers on the left side of the message stream, those three simple commands added to the AWS CDK stack automatically expanded into 32 lower-level, primitive resources to be created on your AWS account.

After the AWS CDK deployment finishes, you have a fresh ECS cluster ready to run your services. Next you will deploy a simple microservices stack onto this cluster. Your application will be a simple greeting server. The frontend greeter service fetches a random greeting and name from two backend services. There are two tiers to this application: the frontend and backend. The network will look like the following diagram:

There are two load balancers, one of them allows anyone on the internet to talk to your greeter service. The other is internal and designed to allow the greeter service to talk to the other greeting and name services.

In total, you need to add five more high-level constructs to your AWS CDK application: two load balancers and three services.

Add a new ECS service

Adding a new ECS service to the application stack is easy. Define a task definition, add a container to it, and tell the AWS CDK to turn the task definition into a service:

// Name service
const nameTaskDefinition = new ecs.Ec2TaskDefinition(this, 'name-task-definition', {});

const nameContainer = nameTaskDefinition.addContainer('name', {
   image: ecs.ContainerImage.fromDockerHub('nathanpeck/name'),
   memoryLimitMiB: 128
});

nameContainer.addPortMappings({
   containerPort: 3000
});

const nameService = new ecs.Ec2Service(this, 'name-service', {
    cluster: cluster,
    desiredCount: 2,
     taskDefinition: nameTaskDefinition
});

// Greeting service
 const greetingTaskDefinition = new ecs.Ec2TaskDefinition(this, 'greeting-task-definition', {});

  const greetingContainer = greetingTaskDefinition.addContainer('greeting', {
    image: ecs.ContainerImage.fromDockerHub ('nathanpeck/greeting'),
    memoryLimitMiB: 128
  });

  greetingContainer.addPortMappings({
    containerPort: 3000
  });

  const greetingService = new ecs.Ec2Service(this, 'greeting-service', {
    cluster: cluster,
    desiredCount: 2,
    taskDefinition: greetingTaskDefinition
  });

Just like that, you’ve defined two different ECS services that run by loading a public image from Docker Hub. The next step is to create a load balancer and add the services to it:

// Internal load balancer for the backend services
    const internalLB = new elbv2.ApplicationLoadBalancer(this, 'internal', {
      vpc: vpc,
      internetFacing: false
    });

    const internalListener = internalLB.addListener('PublicListener', { port: 80, open: true });

    internalListener.addTargetGroups('default', {
      targetGroups: [new elbv2.ApplicationTargetGroup(this, 'default', {
        vpc: vpc,
        protocol: 'HTTP',
        port: 80
      })]
    });

    internalListener.addTargets('name', {
      port: 80,
      pathPattern: '/name*',
      priority: 1,
      targets: [nameService]
    });

    internalListener.addTargets('greeting', {
      port: 80,
      pathPattern: '/greeting*',
      priority: 2,
      targets: [greetingService]
    });

For this configuration, the code defines a single load balancer with a single listener on port 80, but adds two different services behind it. If the path of the request looks like /name, it sends the request to your name service. If it looks like /greeting, it sends the request to the greeting service.

Finally, add the frontend greeter service, which constructs a random greeting phrase by fetching a random name from the name service and a random greeting from the greeting service. To do this, configure the greeter service to know how to make requests to the other two backend services:

    // Greeter service
    const greeterTaskDefinition = new ecs.Ec2TaskDefinition(this, 'greeter-task-definition', {});

    const greeterContainer = greeterTaskDefinition.addContainer('greeter', {
      image: ecs.ContainerImage.fromDockerHub ('nathanpeck/greeter'),
      memoryLimitMiB: 128,
      environment: {
        GREETING_URL: 'http://' + internalLB.dnsName + '/greeting',
        NAME_URL: 'http://' + internalLB.dnsName + '/name'
      }
    });

    greeterContainer.addPortMappings({
      containerPort: 3000
    });

    const greeterService = new ecs.Ec2Service(this, 'greeter-service', {
      cluster: cluster,
      desiredCount: 2,
      taskDefinition: greeterTaskDefinition
    });

The AWS CDK has a powerful capability to resolve expressions that you enter in your JavaScript and turn them into a CloudFormation template that resolves the correct values.

In this example, you create a reference to the DNS name of the load balancer, and indicate that you want to assign the following:

·    Environment variable = 'NAME_URL'

·    Value = 'http://' + internalLB.dnsName + '/name'

If you run cdk synth, you can see that the AWS CDK generates a CloudFormation template that dynamically inserts the proper DNS name of the load balancer at deployment time:

Type: 'AWS::ECS::TaskDefinition'
    Properties:
      ContainerDefinitions:
        - Environment:
            - Name: GREETING_URL
              Value:
                'Fn::Join':
                  - ''
                  - - 'http://'
                    - 'Fn::GetAtt':
                        - internal505AC855
                        - DNSName
                    - /greeting
            - Name: NAME_URL
              Value:
                'Fn::Join':
                  - ''
                  - - 'http://'
                    - 'Fn::GetAtt':
                        - internal505AC855
                        - DNSName
                    - /name

One final thing to add to your AWS CDK stack is an output. This gives you the DNS name of your service so you can send traffic to it:

new cdk.Output(this, 'ExternalDNS', { value: externalLB.dnsName });

Now see what is added when you deploy. Type the following command to see a preview of new or modified resources without actually doing the deployment:

cdk diff

The list of new resources being added looks good so run cdk deploy again. Again, the AWS CDK synthesizes the CloudFormation template, and initializes its deployment on your AWS account. This time, however, it creates a total of 66 resources, and gives you a URL output where the application is hosted:

To verify that your application is up and accepting traffic at that URL, load that internet ExternalDNS URL in your browser. The web application was able to talk to the two other backend services to get a greeting and a name:

Conclusion

If you’d like to try deploying this microservice stack yourself or using it as the basis for building your own AWS CDK stack, you can find the full AWS CDK example code on GitHub. Be sure to check out the AWS CDK documentation and the official AWS CDK construct for Amazon ECS on NPM.

Firecracker – Lightweight Virtualization for Serverless Computing

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/firecracker-lightweight-virtualization-for-serverless-computing/

One of my favorite Amazon Leadership Principles is Customer Obsession. When we launched AWS Lambda, we focused on giving developers a secure serverless experience so that they could avoid managing infrastructure. In order to attain the desired level of isolation we used dedicated EC2 instances for each customer. This approach allowed us to meet our security goals but forced us to make some tradeoffs with respect to the way that we managed Lambda behind the scenes. Also, as is the case with any new AWS service, we did not know how customers would put Lambda to use or even what they would think of the entire serverless model. Our plan was to focus on delivering a great customer experience while making the backend ever-more efficient over time.

Just four years later (Lambda was launched at re:Invent 2014) it is clear that the serverless model is here to stay. Today, Lambda processes trillions of executions for hundreds of thousands of active customers every month. Last year we extended the benefits of serverless to containers with the launch of AWS Fargate, which now runs tens of millions of containers for AWS customers every week.

As our customers increasingly adopted serverless, it was time to revisit the efficiency issue. Taking our Invent and Simplify principle to heart, we asked ourselves what a virtual machine would look like if it was designed for today’s world of containers and functions!

Introducing Firecracker
Today I would like to tell you about Firecracker, a new virtualization technology that makes use of KVM. You can launch lightweight micro-virtual machines (microVMs) in non-virtualized environments in a fraction of a second, taking advantage of the security and workload isolation provided by traditional VMs and the resource efficiency that comes along with containers.

Here’s what you need to know about Firecracker:

Secure – This is always our top priority! Firecracker uses multiple levels of isolation and protection, and exposes a minimal attack surface.

High Performance – You can launch a microVM in as little as 125 ms today (and even faster in 2019), making it ideal for many types of workloads, including those that are transient or short-lived.

Battle-TestedFirecracker has been battled-tested and is already powering multiple high-volume AWS services including AWS Lambda and AWS Fargate.

Low OverheadFirecracker consumes about 5 MiB of memory per microVM. You can run thousands of secure VMs with widely varying vCPU and memory configurations on the same instance.

Open SourceFirecracker is an active open source project. We are already ready to review and accept pull requests, and look forward to collaborating with contributors from all over the world.

Firecracker was built in a minimalist fashion. We started with crosvm and set up a minimal device model in order to reduce overhead and to enable secure multi-tenancy. Firecracker is written in Rust, a modern programming language that guarantees thread safety and prevents many types of buffer overrun errors that can lead to security vulnerabilities.

Firecracker Security
As I mentioned earlier, Firecracker incorporates a host of security features! Here’s a partial list:

Simple Guest ModelFirecracker guests are presented with a very simple virtualized device model in order to minimize the attack surface: a network device, a block I/O device, a Programmable Interval Timer, the KVM clock, a serial console, and a partial keyboard (just enough to allow the VM to be reset).

Process Jail – The Firecracker process is jailed using cgroups and seccomp BPF, and has access to a small, tightly controlled list of system calls.

Static Linking – The firecracker process is statically linked, and can be launched from a jailer to ensure that the host environment is as safe and clean as possible.

Firecracker in Action
To get some experience with Firecracker, I launch an i3.metal instance and download three files (the firecracker binary, a root file system image, and a Linux kernel):

I need to set up the proper permission to access /dev/kvm:

$  sudo setfacl -m u:${USER}:rw /dev/kvm

I start firecracker in one PuTTY session, and then issue commands in another (the process listens on a Unix-domain socket and implements a REST API). The first command sets the configuration for my first guest machine:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/machine-config" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"vcpu_count\": 1,
        \"mem_size_mib\": 512
    }"

And, the second sets the guest kernel:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/boot-source" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"kernel_image_path\": \"./hello-vmlinux.bin\",
        \"boot_args\": \"console=ttyS0 reboot=k panic=1 pci=off\"
    }"

And, the third one sets the root file system:

$ curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/drives/rootfs" \
    -H "accept: application/json" \
    -H "Content-Type: application/json" \
    -d "{
        \"drive_id\": \"rootfs\",
        \"path_on_host\": \"./hello-rootfs.ext4\",
        \"is_root_device\": true,
        \"is_read_only\": false
    }"

With everything set to go, I can launch a guest machine:

# curl --unix-socket /tmp/firecracker.sock -i \
    -X PUT "http://localhost/actions" \
    -H  "accept: application/json" \
    -H  "Content-Type: application/json" \
    -d "{
        \"action_type\": \"InstanceStart\"
     }"

And I am up and running with my first VM:

In a real-world scenario I would script or program all of my interactions with Firecracker, and I would probably spend more time setting up the networking and the other I/O. But re:Invent awaits and I have a lot more to do, so I will leave that part as an exercise for you.

Collaborate with Us
As you can see this is a giant leap forward, but it is just a first step. The team is looking forward to telling you more, and to working with you to move ahead. Star the repo, join the community, and send us some code!

Jeff;

 

 

 

 

timeShift(GrafanaBuzz, 1w) Issue 47

Post Syndicated from Blogs on Grafana Labs Blog original https://grafana.com/blog/2018/06/01/timeshiftgrafanabuzz-1w-issue-47/

Welcome to TimeShift We cover a lot of ground this week with posts on general monitoring principles, home automation, how CERN uses open source projects in their particle acceleration work, and more. Have an article you’d like highlighted here? Get in touch.
We’re excited to be a sponsor of Monitorama PDX June 4-6. If you’re going, please be sure and say hello! Latest Release: Grafana 5.1.3 This latest point release fixes a scrolling issue that was reported in Firefox.

Robin “Roblimo” Miller

Post Syndicated from corbet original https://lwn.net/Articles/755563/rss

The Linux Journal mourns
the passing of Robin Miller
, a longtime presence in our community.
Miller was perhaps best known by the community for his roll as
Editor in Chief of Open Source Technology Group, the company that owned
Slashdot, SourceForge.net, freshmeat, Linux.com, NewsForge, and ThinkGeek
from 2000 to 2008.

The Benefits of Side Projects

Post Syndicated from Bozho original https://techblog.bozho.net/the-benefits-of-side-projects/

Side projects are the things you do at home, after work, for your own “entertainment”, or to satisfy your desire to learn new stuff, in case your workplace doesn’t give you that opportunity (or at least not enough of it). Side projects are also a way to build stuff that you think is valuable but not necessarily “commercialisable”. Many side projects are open-sourced sooner or later and some of them contribute to the pool of tools at other people’s disposal.

I’ve outlined one recommendation about side projects before – do them with technologies that are new to you, so that you learn important things that will keep you better positioned in the software world.

But there are more benefits than that – serendipitous benefits, for example. And I’d like to tell some personal stories about that. I’ll focus on a few examples from my list of side projects to show how, through a sort-of butterfly effect, they helped shape my career.

The computoser project, no matter how cool algorithmic music composition, didn’t manage to have much of a long term impact. But it did teach me something apart from niche musical theory – how to read a bulk of scientific papers (mostly computer science) and understand them without being formally trained in the particular field. We’ll see how that was useful later.

Then there was the “State alerts” project – a website that scraped content from public institutions in my country (legislation, legislation proposals, decisions by regulators, new tenders, etc.), made them searchable, and “subscribable” – so that you get notified when a keyword of interest is mentioned in newly proposed legislation, for example. (I obviously subscribed for “information technologies” and “electronic”).

And that project turned out to have a significant impact on the following years. First, I chose a new technology to write it with – Scala. Which turned out to be of great use when I started working at TomTom, and on the 3rd day I was transferred to a Scala project, which was way cooler and much more complex than the original one I was hired for. It was a bit ironic, as my colleagues had just read that “I don’t like Scala” a few weeks earlier, but nevertheless, that was one of the most interesting projects I’ve worked on, and it went on for two years. Had I not known Scala, I’d probably be gone from TomTom much earlier (as the other project was restructured a few times), and I would not have learned many of the scalability, architecture and AWS lessons that I did learn there.

But the very same project had an even more important follow-up. Because if its “civic hacking” flavour, I was invited to join an informal group of developers (later officiated as an NGO) who create tools that are useful for society (something like MySociety.org). That group gathered regularly, discussed both tools and policies, and at some point we put up a list of policy priorities that we wanted to lobby policy makers. One of them was open source for the government, the other one was open data. As a result of our interaction with an interim government, we donated the official open data portal of my country, functioning to this day.

As a result of that, a few months later we got a proposal from the deputy prime minister’s office to “elect” one of the group for an advisor to the cabinet. And we decided that could be me. So I went for it and became advisor to the deputy prime minister. The job has nothing to do with anything one could imagine, and it was challenging and fascinating. We managed to pass legislation, including one that requires open source for custom projects, eID and open data. And all of that would not have been possible without my little side project.

As for my latest side project, LogSentinel – it became my current startup company. And not without help from the previous two mentioned above – the computer science paper reading was of great use when I was navigating the crypto papers landscape, and from the government job I not only gained invaluable legal knowledge, but I also “got” a co-founder.

Some other side projects died without much fanfare, and that’s fine. But the ones above shaped my “story” in a way that would not have been possible otherwise.

And I agree that such serendipitous chain of events could have happened without side projects – I could’ve gotten these opportunities by meeting someone at a bar (unlikely, but who knows). But we, as software engineers, are capable of tilting chance towards us by utilizing our skills. Side projects are our “extracurricular activities”, and they often lead to unpredictable, but rather positive chains of events. They would rarely be the only factor, but they are certainly great at unlocking potential.

The post The Benefits of Side Projects appeared first on Bozho's tech blog.

Creating a 1.3 Million vCPU Grid on AWS using EC2 Spot Instances and TIBCO GridServer

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/creating-a-1-3-million-vcpu-grid-on-aws-using-ec2-spot-instances-and-tibco-gridserver/

Many of my colleagues are fortunate to be able to spend a good part of their day sitting down with and listening to our customers, doing their best to understand ways that we can better meet their business and technology needs. This information is treated with extreme care and is used to drive the roadmap for new services and new features.

AWS customers in the financial services industry (often abbreviated as FSI) are looking ahead to the Fundamental Review of Trading Book (FRTB) regulations that will come in to effect between 2019 and 2021. Among other things, these regulations mandate a new approach to the “value at risk” calculations that each financial institution must perform in the four hour time window after trading ends in New York and begins in Tokyo. Today, our customers report this mission-critical calculation consumes on the order of 200,000 vCPUs, growing to between 400K and 800K vCPUs in order to meet the FRTB regulations. While there’s still some debate about the magnitude and frequency with which they’ll need to run this expanded calculation, the overall direction is clear.

Building a Big Grid
In order to make sure that we are ready to help our FSI customers meet these new regulations, we worked with TIBCO to set up and run a proof of concept grid in the AWS Cloud. The periodic nature of the calculation, along with the amount of processing power and storage needed to run it to completion within four hours, make it a great fit for an environment where a vast amount of cost-effective compute power is available on an on-demand basis.

Our customers are already using the TIBCO GridServer on-premises and want to use it in the cloud. This product is designed to run grids at enterprise scale. It runs apps in a virtualized fashion, and accepts requests for resources, dynamically provisioning them on an as-needed basis. The cloud version supports Amazon Linux as well as the PostgreSQL-compatible edition of Amazon Aurora.

Working together with TIBCO, we set out to create a grid that was substantially larger than the current high-end prediction of 800K vCPUs, adding a 50% safety factor and then rounding up to reach 1.3 million vCPUs (5x the size of the largest on-premises grid). With that target in mind, the account limits were raised as follows:

  • Spot Instance Limit – 120,000
  • EBS Volume Limit – 120,000
  • EBS Capacity Limit – 2 PB

If you plan to create a grid of this size, you should also bring your friendly local AWS Solutions Architect into the loop as early as possible. They will review your plans, provide you with architecture guidance, and help you to schedule your run.

Running the Grid
We hit the Go button and launched the grid, watching as it bid for and obtained Spot Instances, each of which booted, initialized, and joined the grid within two minutes. The test workload used the Strata open source analytics & market risk library from OpenGamma and was set up with their assistance.

The grid grew to 61,299 Spot Instances (1.3 million vCPUs drawn from 34 instance types spanning 3 generations of EC2 hardware) as planned, with just 1,937 instances reclaimed and automatically replaced during the run, and cost $30,000 per hour to run, at an average hourly cost of $0.078 per vCPU. If the same instances had been used in On-Demand form, the hourly cost to run the grid would have been approximately $93,000.

Despite the scale of the grid, prices for the EC2 instances did not move during the bidding process. This is due to the overall size of the AWS Cloud and the smooth price change model that we launched late last year.

To give you a sense of the compute power, we computed that this grid would have taken the #1 position on the TOP 500 supercomputer list in November 2007 by a considerable margin, and the #2 position in June 2008. Today, it would occupy position #360 on the list.

I hope that you enjoyed this AWS success story, and that it gives you an idea of the scale that you can achieve in the cloud!

Jeff;

[$] Containers and license compliance

Post Syndicated from jake original https://lwn.net/Articles/752982/rss

Containers are, of course, all the rage these days; in fact, during his
2018 Legal and
Licensing Workshop
(LLW) talk, Dirk Hohndel
said with a grin that he hears “containers may take off”. But, while
containers are
easy to set up and use, license compliance for containers is “incredibly
hard”. He has been spending “way too much time” thinking about container
compliance recently and, beyond the standard “let’s go shopping” solution
to hard problems, has come up with some ideas.
Hohndel is a longtime member of the FOSS community who is now the chief
open source officer at VMware—a company that ships some container images.

Former Judge Accuses IP Court of Using ‘Pirate’ Microsoft Software

Post Syndicated from Andy original https://torrentfreak.com/former-judge-accuses-ip-court-of-using-pirate-microsoft-software-180429/

While piracy of movies, TV shows, and music grabs most of the headlines, software piracy is a huge issue, from both consumer and commercial perspectives.

For many years, software such as Photoshop has been pirated on a grand scale and around the world, millions of computers rely on cracked and unlicensed copies of Microsoft’s Windows software.

One of the key drivers of this kind of piracy is the relative expense of software. Open source variants are nearly always available but big brand names always seem more popular due to their market penetration and perceived ease of use.

While using pirated software very rarely gets individuals into trouble, the same cannot be said of unlicensed commercial operators. That appears to be the case in Russia where somewhat ironically the Court for Intellectual Property Rights stands accused of copyright infringement.

A complaint filed by the Paragon law firm at the Prosecutor General’s Office of the Court for Intellectual Property Rights (CIP) alleges that the Court is illegally using Microsoft software, something which has the potential to affect the outcome of court cases involving the US-based software giant.

Paragon is representing Alexander Shmuratov, who is a former Assistant Judge at the Court for Intellectual Property Rights. Shmuratov worked at the Court for several years and claims that the computers there were being operated with expired licenses.

Shmuratov himself told Kommersant that he “saw the notice of an activation failure every day when using MS Office products” in intellectual property court.

A representative of the Prosecutor General’s Office confirmed that a complaint had been received but said it had been forwarded to the Ministry of Internal Affairs.

In respect of the counterfeit software claims, CIP categorically denies the allegations. CIP says that licenses for all Russian courts were purchased back in 2008 and remained in force until 2011. In 2013, Microsoft agreed to an extension.

Only adding more intrigue to the story, CIP Assistant chairman Catherine Ulyanova said that the initator of the complaint, former judge Alexander Shmuratov, was dismissed from the CIP because he provided false information about income. He later mounted a challenge against his dismissal but was unsuccessful.

Ulyanova said that Microsoft licensed all courts from 2006 for use of Windows and MS Office. The licenses were acquired through a third-party company and more licenses than necessary were purchased, with some licenses being redistributed for use by CIP in later years with the consent of Microsoft.

Kommersant was unable to confirm how licenses were paid for beyond December 2011 but apparently an “official confirmation letter from the Irish headquarters of Microsoft, which does not object to the transfer of CIP licenses” had been sent to the Court.

Responding to Shmuratov’s allegations that software he used hadn’t been activated, Ulyanova said that technical problems had no relationship with the existence of software licenses.

The question of whether the Court is properly licensed will be determined at a later date but observers are already raising questions concerning CIP’s historical dealings with Microsoft not only in terms of licensing, but in cases it handled.

In the period 2014-2017, the Court for Intellectual Property Rights handled around 80 cases involving Microsoft and claims of between 50 thousand ($800) and several million rubles.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Ubuntu 18.04 LTS (Bionic Beaver) released

Post Syndicated from corbet original https://lwn.net/Articles/752904/rss

Ubuntu 18.04, a long-term-support release, is out.
Codenamed ‘Bionic Beaver’, 18.04 LTS continues Ubuntu’s proud tradition
of integrating the latest and greatest open source technologies into a
high-quality, easy-to-use Linux distribution. The team has been hard at
work through this cycle, introducing new features and fixing bugs.

It features a 4.15 kernel, a new GNOME-based desktop environment, and
more. See the
release notes
and this overview for details.