Tag Archives: GraphQL

Beyond REST

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/beyond-rest-1b76f7c20ef6

Rapid Development with GraphQL Microservices

by Dane Avilla

The entertainment industry has struggled with COVID-19 restrictions impacting productions around the globe. Since early 2020, Netflix has been iteratively developing systems to provide internal stakeholders and business leaders with up-to-date tools and dashboards with the latest information on the pandemic. These software solutions allow executive leadership to make the most informed decisions possible regarding if and when a given physical production can safely begin creating compelling content across the world. One approach that is gaining mind-share within Netflix is the concept of GraphQL microservices (GQLMS) as a backend platform facilitating rapid application development.

Many organizations are embracing GraphQL as a way to unify their enterprise-wide data model and provide a single entry point for navigating a sea of structured data with its network of related entities. Such efforts are laudable but often entail multiple calendar quarters of coordination between internal organizations followed by the development and integration of all relevant entities into a single monolithic graph.

In contrast to this “One Graph to Rule Them All” approach, GQLMS leverage GraphQL simply as an enriched API specification for building CRUD applications. Our experience using GQLMS for rapid proof-of-concept applications confirmed two theories regarding the advertised benefits of GraphQL:

  • The GraphiQL IDE displays any available GraphQL documentation right alongside the schema, dramatically improving developer ergonomics for API consumers (in contrast to the best-in-class Swagger UI).
  • GraphQL’s strong type system and polyglot client support mean API providers do not need to concern themselves with generating, versioning, and maintaining language-specific API clients (such as those generated with the excellent Swagger Codegen). Consumers of GraphQL APIs can simply leverage the open-source GraphQL client of their preference.
GraphiQL: Auto-generated test GUI for the Star Wars API

Our experience has led to an architecture with a number of best-practices for teams interested in GQLMS as a platform for rapid development.

Graphile

During early GraphQL exploration efforts, Netflix engineers became aware of the Graphile library for presenting PostgreSQL database objects (tables, views, and functions) as a GraphQL API. Graphile supports smart comments allowing control of various features by tagging database tables, views, columns, and types with specifically formatted PostgreSQL comments. Documentation can even be embedded in the database comments such that it displays in the GraphQL schema generated by Graphile.

We hypothesized that a Docker container running a very simple NodeJS web server with the Graphile library (and some additional Netflix internal components for security, logging, metrics, and monitoring) could provide a “better REST than REST” or “REST++” platform for rapid development efforts. Using Docker we defined a lightweight, stand-alone container that allowed us to package the Graphile library and its supporting code into a self-contained bundle that any team can use at Netflix with no additional coding required. Simply pull down the defined Docker base image and run it with the appropriate database connection string. This approach proved to be very successful and yielded several insights into the use of Graphile.

Specifically:

  • Use database views as an “API layer” to preserve flexibility in order to allow modifying tables without changing an existing GraphQL schema (built on the database views).
  • Use PostgreSQL Composite Types when taking advantage of PostgreSQL Aggregate Functions.
  • Increase flexibility by allowing GraphQL clients to have “full access” to the auto-generated GraphQL queries and mutations generated by Graphile (exposing CRUD operations on all tables & views); then later in the development process, remove schema elements that did not end up being used by the UI before the app goes into production.

Database views as API

We decided to put the data tables in one PostgreSQL schema and then define views on those tables in another schema, with the Graphile web app connecting to the database using a dedicated PostgreSQL user role. This ended up achieving several different goals:

  • Underlying tables could be changed independently of the views exposed in the GraphQL schema.
  • Views could do basic formatting (like rendering TIMESTAMP fields as ISO8601 strings).
  • All permissions on the underlying table had to be explicitly granted for the web application’s PostgreSQL user, avoiding unexpected write access.
  • Tables and views could be modified within a single transaction such that the changes to the exposed GraphQL schema happened atomically.

On this last point: changing a table column’s type would break the associated view, but by wrapping the change in a transaction, the view could be dropped, the column could be updated, and then the view could be re-created before committing the transaction. We run Graphile with pgWatch enabled, so as soon as any updates were made to the database, the GraphQL schema immediately updated to reflect the change.

PostgreSQL composite types

Graphile does an excellent job reading the PostgreSQL database schema and transforming tables and basic views into a GraphQL schema, but our experience revealed limitations in how Graphile describes nested types when PostgreSQL Aggregate Functions or JSON Functions exist within a view. Native PostgreSQL functions such as json_build_object will be translated into a GraphQL JSON type, which is simply a String, devoid of any internal structure. For example, take this simplistic view returning a JSON object:

postgres_test_db=# create view postgraphile.json_object_example as
select json_build_object(‘hello world’::text, 1, ‘2’::text, 3)
as json;
postgres_test_db=# select * from postgraphile.json_object_example;
json
— — — — — — — — — — — — -
{“hello world”: 1, “2”: 3}
(1 row)

In the generated schema, the data type is JSON:

The internal structure of the json field (the hello world and 2 sub-fields) is opaque in the generated GraphQL schema.

To further describe the internal structure of the json field — exposing it within the generated schema — define a composite type, and create the view such that it returns that type:

postgres_test_db=# CREATE TYPE postgraphile.custom_type AS (
"hello world" integer,
"2" integer
);

Next, create a function that returns that type:

postgres_test_db=# CREATE FUNCTION postgraphile.custom_type(
"hello world" integer,
"2" integer
)
RETURNS postgraphile.custom_type
AS 'select $1, $2'
LANGUAGE SQL;

Finally, create a view that returns that type:

postgres_test_db=# create view postgraphile.json_object_example2 as
select postgraphile.custom_type(1, 3)
as json;
postgres_test_db=# select * from postgraphile.json_object_example2;
json
— — — -
(1,3)
(1 row)

At first glance, that does not look very useful, but hold that thought: before viewing the generated schema, define comments on the view, custom type, and fields of the custom type to take advantage of Graphile’s smart comments:

postgres_test_db=# comment on
type postgraphile.custom_type
is E’A description for the custom type’;
postgres_test_db=# comment on
view postgraphile.json_object_example2
is E’A description for the view’;
postgres_test_db=# comment on
column postgraphile.custom_type.”hello world”
is E’A description for hello world’;
postgres_test_db=# comment on
column postgraphile.custom_type.field_2
is E’@name field_two\nA description for the second field’;

Now, when the schema is viewed, the json field no longer shows up with opaque type JSON, but with CustomType:

(also note that the comment made on the view — A description for the view — shows up in the documentation for the query field).

Clicking CustomType displays the fields of the custom type, along with their comments:

Notice that in the custom type, the second field was named field_2, but the Graphile smart comment renames the field to field_two and subsequently gets camel-cased by Graphile to fieldTwo. Also, the descriptions for both fields display in the generated GraphQL schema.

Allow “full access” to the Graphile-generated schema (during development)

Initially, the proposal to use Graphile was met with vigorous dissent when discussed as an option in a “one schema to rule them all” architecture. Legitimate concerns about security (how does this integrate with our IAM infrastructure to enforce row-level access controls within the database?) and performance (how do you limit queries to avoid DDoSing the database by selecting all rows at once?) were raised about providing open access to database tables with a SQL-like query interface. However, in the context of GQLMS for rapid development of internal apps by small teams, having the default Graphile behavior of making all columns available for filtering allowed the UI team to rapidly iterate through a number of new features without needing to involve the backend team. This is in contrast to other development models where the UI and backend teams first agree on an initial API contract, the backend team implements the API, the UI team consumes the API and then the API contract evolves as the needs of the UI change during the development life cycle.

Initially, the overall app’s performance was poor as the UI often needed multiple queries to fetch the desired data. However, once the app’s behavior had been fleshed out, we quickly created new views satisfying each UI interaction’s needs such that each interaction only required a single call. Because these requests run on the database in native code, we could perform sophisticated queries and achieve high performance through the appropriate use of indexes, denormalization, clustering, etc.

Once the “public API” between the UI and backend solidified, we “hardened” the GraphQL schema, removing all unnecessary queries (created by Graphile’s default settings) by marking tables and views with the smart comment @omit. Also, the default behavior is for Graphile to generate mutations for tables and views, but the smart comment @omit create,update,delete will remove the mutations from the schema.

Conclusion

For those taking a schema-first approach to their GraphQL API development, the automatic GraphQL schema generation capabilities of Graphile will likely unacceptably restrict schema designers. Graphile may be difficult to integrate into an existing enterprise IAM infrastructure if fine-grained access controls are required. And adding custom queries and mutations to a Graphile-generated schema (i.e. to expose a gRPC service call needed by the UI) is something we currently do not support in our Docker image. However, we recently became aware of Graphile’s makeExtendSchemaPlugin, which allows custom types, queries, and mutations to be merged into the schema generated by Graphile.

That said, the successful implementation of an internal app over 4–6 weeks with limited initial requirements and an ad hoc distributed team (with no previous history of collaboration) raised a large amount of interest throughout the Netflix Studio. Other teams within Netflix are finding the GQLMS approach of:

1) using standard GraphQL constructs and utilities to expose the database-as-API

2) leveraging custom PostgreSQL types to craft a GraphQL schema

3) increasing flexibility by auto-generating a large API from a database

4) and exposing additional custom business logic and data types alongside those generated by Graphile

to be a viable solution for internal CRUD tools that would historically have used REST. Having a standardized Docker container hosting Graphile provides teams the necessary infrastructure by which they can quickly iterate on the prototyping and rapid application development of new tools to solve the ever-changing needs of a global media studio during these challenging times.


Beyond REST was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

New global ID format coming to GraphQL

Post Syndicated from Wissam Abirached original https://github.blog/2021-02-10-new-global-id-format-coming-to-graphql/

The GitHub GraphQL API has been publicly available for over 4 years now. Its usage has grown immensely over time, and we’ve learned a lot from running one of the largest public GraphQL APIs in the world. Today, we are introducing a new format for object identifiers and a roll out plan that brings it to our GraphQL API this year.

What is driving the change?

As GitHub grows and reaches new scale milestones, we came to the realization that the current format of Global IDs in our GraphQL API will not support our projected growth over the coming years. The new format gives us more flexibility and scalability in handling your requests even faster.

What exactly is changing?

We are changing the Global ID format in our GraphQL API. As a result, all object identifiers in GraphQL will change and some identifiers will become longer than they are now. Since you can get an object’s Global ID via the REST API, these changes will also affect an object’s node_id returned via the REST API. Object identifiers will continue to be opaque strings and should not be decoded.

How will this be rolled out?

We understand that our APIs are a critical part of your engineering workflows, and our goal is to minimize the impact as much as possible. In order to give you time to migrate your implementations, caches, and data records to the new Global IDs, we will go through a gradual rollout and thorough deprecation process that includes three phases.

  1. Introduce new format: This phase will introduce the new Global IDs into the wild on a type by type basis, for newly created objects only. Existing objects will continue to have the same ID. We will start by migrating the least requested object types, working our way towards the most popular types. Note that the new Global IDs may be longer and, in case you were storing the ID, you should ensure you can store the longer IDs. During this phase, as long as you can handle the potentially longer IDs, no action is required by you. The expected duration of this phase is 3 months.
  2. Migrate: In this phase you should look to update your caches and data records. We will introduce migration tools allowing you to toggle between the two formats making it easy for you to update your caches to the new IDs. The migration tools will be detailed in a separate blog post, closer to launch date. You will be able to use the old or new IDs to refer to an object throughout this phase. The expected duration of this phase is 3 months.
  3. Deprecate: In this phase, all REST API requests and GraphQL queries will return the new IDs. Requests made with the old IDs will continue to work, but the response will only include the new ID as well as a deprecation warning. The expected duration of this phase is 3 months.

Once the three migration phases are complete, we will sunset the old IDs. All requests made using the old IDs will result in an error. Overall, the whole process should take 9 months, with the goal of giving you plenty of time to adjust and migrate to the new format.

Tell us what you think

If you have any concerns about the rollout of this change impacting your app, please contact us and include information such as your app name so that we can better assist you.

Open Sourcing the Netflix Domain Graph Service Framework: GraphQL for Spring Boot

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/open-sourcing-the-netflix-domain-graph-service-framework-graphql-for-spring-boot-92b9dcecda18

By Paul Bakker and Kavitha Srinivasan, Images by David Simmer, Edited by Greg Burrell

Netflix has developed a Domain Graph Service (DGS) framework and it is now open source. The DGS framework simplifies the implementation of GraphQL, both for standalone and federated GraphQL services. Our framework is battle-hardened by our use at scale.

By open-sourcing the project, we hope to contribute to the Java and GraphQL communities and learn from and collaborate with everyone who will be using the framework to make it even better in the future.

The key features of the DGS Framework include:

  • Annotation-based Spring Boot programming model
  • Test framework for writing query tests as unit tests
  • Gradle Code Generation plugin to create Java/Kotlin types from a GraphQL schema
  • Easy integration with GraphQL Federation
  • Integration with Spring Security
  • GraphQL subscriptions (WebSockets and SSE)
  • File uploads
  • Error handling
  • Automatic support for interface/union types
  • A GraphQL client for Java
  • Pluggable instrumentation

Why We Needed a DGS Framework

Around Spring 2019, Netflix embarked on a great adventure towards implementing a federated GraphQL architecture. Our colleagues wrote a Netflix Tech Blog post describing the details of this architecture. The transition to the new federated architecture meant that many of our backend teams needed to adopt GraphQL in our Java ecosystem. As you may recall from a previous blog post, Netflix has standardized on Spring Boot for backend development. Therefore, to make this federated architecture a success, we needed to have a great developer experience for GraphQL in Spring Boot.

We created our framework on top of Spring Boot and it leverages the graphql-java library. This framework was initially intended to be internal only, focusing on integration with the Netflix ecosystem for tracing, logging, metrics, etc. However, proper modularization of the framework was always top of mind. It became apparent that much of the framework we had built was not actually Netflix specific. The framework was mostly just an easier way to build GraphQL services, both standalone and federated.

Schema-First Development

A schema represents the GraphQL API. The schema is what makes GraphQL so powerful and different from REST. A GraphQL schema describes the API in terms of Query and Mutation operations along with their related types and fields. The API user can specify precisely which fields to retrieve in a query, making a GraphQL API very flexible.

There are two different approaches to GraphQL development; schema-first and code-first development. With schema-first development, you manually define your API’s schema using the GraphQL Schema Language. The code in your service only implements this schema.

With code-first development, you don’t have a schema file. Instead, the schema gets generated at runtime based on definitions in code.

Both approaches, schema-first and code-first, are supported in our framework. At Netflix we strongly prefer schema-first development because:

  1. The schema design is front and center of the developer experience.
  2. It provides an easy way for tooling to consume the schema.
  3. Backward-incompatible changes are more obvious with schema diffs. Backward compatibility is even more critical when working in a Federated GraphQL architecture.

Although it might be marginally quicker to generate schema from the code, putting the time into designing your schema in a human readable, collaborative way is well worth the effort towards a better API.

The Framework in Action

The framework’s core revolves around the annotation-based programming model familiar to Spring Boot developers. Comprehensive documentation is available on the website but let’s walk through an example to show you how easy it is to use this framework.

Let’s start with a simple schema.

To implement this API, we need to write a data fetcher.

The Show type is a simple POJO that we would typically generate using the DGS Code Generation plugin for Gradle. A method annotated with @DgsData implements a data fetcher for a field. Note that we don’t need data fetchers for each field, we can return Java objects, and the framework will take care of the rest.The framework also has many conveniences such as the @InputArgument annotation used in this example.

This code is enough to get a GraphQL endpoint running. Just start the Spring Boot application, and the /graphql endpoint will be available, along with the GraphiQL query editor on /graphiql that comes out of the box. Although the code in this example is straightforward, it wouldn’t look much different if we work with Federated types, use @Secured, or add metrics and tracing using an extension point. The framework takes care of all the heavy lifting.

Another key feature is support for lightweight query tests. These tests allow you to execute queries without the need to work with the HTTP endpoint. The tests look and feel like plain JUnit tests.

Full documentation for the framework is available on the DGS Framework github repository.

Fitting into the GraphQL Server Ecosystem

So how exactly does the DGS framework fit into the existing GraphQL ecosystem? The current ecosystem comprises servers, clients, the federated gateways, and tooling to help with query testing, schema management, code generation, etc. When it comes to building GraphQL servers using JVM, there are both schema-first and code-first libraries available.

A popular code-first library is graphql-kotlin for Kotlin. graphql-java is most popular for implementing schema-first GraphQL APIs in Java, but is designed to be a low level library. The graphql-java-kickstart starter is a set of libraries for implementing GraphQL services, and provides graphql-java-tools and graphql-java-servlet on top of graphql-java.

Regardless of whether you use Java or Kotlin, our framework provides an easy way to build GraphQL services in Spring Boot. It can be used to build a standalone service as well as in the context of Federated GraphQL.

Federation

The DGS Framework provides a convenient way to implement GraphQL services with federation. Federation allows services to share a unified graph exposed by a gateway. Typically, services share and extend types defined in the unified schema using the @extends directive as defined by Apollo’s federation specification. This is an effective way to split the ownership of a large monolithic GraphQL schema across microservices.

For an incoming query, the federated gateway constructs a query plan to call out to the required services to fulfill that query. Each service, in turn, needs to be able to respond to the _entities query in order to partially fulfill the query for the data it owns.

Here is an example of a Reviews service that extends the Show type defined earlier with a reviews field:

Federated GraphQL Architecture with Shows and Reviews DGSs

Given this schema, the Reviews DGS needs to implement a resolver for the federated Show type with the reviews field populated. This can be done easily using the @DgsEntityFetcher annotation as shown here:

The framework also makes it easy to test federated queries using code generation to generate the _entities query for the service based on the schema. The complete code for the given example can be found here.

Framework Architecture

From the early days of development, we focused on good modularization of the code. This was an important design choice that made it possible to open source most of the framework without impacting our internal teams. We couldn’t use the module system introduced in Java 9 yet, because a lot of applications at Netflix are still using Java 8. However, with the help of Gradle api and implementation modules, we were able to create a clean module structure. At Netflix, we have many extensions for Spring Boot to integrate with our infrastructure. We call this Spring Boot Netflix. The DGS framework is built on standard open-source Spring Boot. On top of that, we have some modules that integrate with our specific infrastructure and use only extension points provided by the core framework.

The following is a diagram of how the modules fit together:

DGS Framework with Netflix and OSS modules

Distributed Tracing and Metrics

At Netflix, we have custom infrastructure for features like tracing, metrics, distributed logging, and authentication/authorization. As mentioned earlier, the DGS framework integrates with this infrastructure to provide a seamless experience out of the box. While these features are not open-sourced, they are easy enough to add to the framework.

The framework supports Instrumentation classes as defined in the graphql-java library. By implementing the Instrumentation interface and annotating it @Component, the framework is able to pick it up automatically. You can find some reference examples in our documentation. In the future, we are hopeful and excited to see community contributions around common patterns for distributed tracing and metrics.

Try It Out Today

To get started with the DGS Framework, refer to our documentation and tutorials. To contribute to the DGS framework, please check out the DGS Framework project on GitHub. We also have a Gradle code generation plugin for generating Java and Kotlin types from a GraphQL schema. To contribute to the code generation plugin, please check out the project on GitHub.

A Team Effort

The DGS Framework has been a success at Netflix owing to the efforts of multiple teams coming together. We would like to acknowledge our close collaborators from the BFG team with whom we started on this amazing journey. We would also like to thank our many users for their timely feedback and code contributions.

If you are passionate about GraphQL and building great developer experiences then check out the many job opportunities on our Netflix website.


Open Sourcing the Netflix Domain Graph Service Framework: GraphQL for Spring Boot was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Netflix Scales its API with GraphQL Federation (Part 2)

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-2-bbe71aaec44a

In our previous post and QConPlus talk, we discussed GraphQL Federation as a solution for distributing our GraphQL schema and implementation. In this post, we shift our attention to what is needed to run a federated GraphQL platform successfully — from our journey implementing it to lessons learned.

Netflix GraphQL Federation

Our Journey so Far

Over the past year, we’ve implemented the core infrastructure pieces necessary for a federated GraphQL architecture as described in our previous post:

Studio Edge Architecture Diagram
Studio Edge Architecture

The first Domain Graph Service (DGS) on the platform was the former GraphQL monolith that we discussed in our first post (Studio API). Next, we worked with a few other application teams to make DGSs that would expose their APIs alongside the former monolith. We had our first Studio applications consuming the federated graph, without any performance degradation, by the end of the 2019. Once we knew that the architecture was feasible, we focused on readying it for broader usage. Our goal was to open up the Studio Edge platform for self-service in April 2020.

April 2020 was a turbulent time with the pandemic and overnight transition to working remotely. Nevertheless, teams started to jump into the graph in droves. Soon we had hundreds of engineers contributing directly to the API on a daily basis. And what about that Studio API monolith that used to be a bottleneck? We migrated the fields exposed by Studio API to individually owned DGSs without breaking the API for consumers. The original monolith is slated to be completely deprecated by the end of 2020.

This journey hasn’t been without its challenges. The biggest challenge was aligning on this strategy across the organization. Initially, there was a lot of skepticism and dissent; the concept was fairly new and would require high alignment across the organization to be successful. Our team spent a lot of time addressing dissenting points and making adjustments to the architecture based on feedback from developers. Through our prototype development and proactive partnership with some key critical voices, we were able to instill confidence and close crucial gaps.

Once we achieved broad alignment on the idea, we needed to ensure that adoption was seamless. This required building robust core infrastructure, ensuring a great developer experience, and solving for key cross-cutting concerns.

Core Infrastructure

Our GraphQL Gateway is based on Apollo’s reference implementation and is written in Kotlin. This gives us access to Netflix’s Java ecosystem, while also giving us the robust language features such as coroutines for efficient parallel fetches, and an expressive type system with null safety.

The schema registry is developed in-house, also in Kotlin. For storing schema changes, we use an internal library that implements the event sourcing pattern on top of the Cassandra database. Using event sourcing allows us to implement new developer experience features such as the Schema History view. The schema registry also integrates with our CI/CD systems like Spinnaker to automatically setup cloud networking for DGSs.

Developer Education & Experience

In the previous architecture, only the monolith Studio API team needed to learn GraphQL. In Studio Edge, every DGS team needs to build expertise in GraphQL. GraphQL has its own learning curve and can get especially tricky for complex cases like batching & lookahead. Also, as discussed in the previous post, understanding GraphQL Federation and implementing entity resolvers is not trivial either.

We partnered with Netflix’s Developer Experience (DevEx) team to build out documentation, training materials, and tutorials for developers. For general GraphQL questions, we lean on the open source community plus cultivate an internal GraphQL community to discuss hot topics like pagination, error handling, nullability, and naming conventions.

DGS Framework & Developer Tools

To make it easy for backend engineers to build a GraphQL DGS, the DevEx team built a “DGS Framework” on top of GraphQL Java and Spring Boot. The framework takes care of all the cross-cutting concerns of running a GraphQL service in production while also making it easier for developers to write GraphQL resolvers. In addition, DevEx built robust tooling for pushing schemas to the Schema Registry and a Self Service UI for browsing the various DGS’s schemas. Check out their conference talk and expect a future blog post from our colleagues. The DGS framework is planned to be open-sourced in early 2021.

Schema Governance

Netflix’s studio data is extremely rich and complex. Early on, we anticipated that active schema management would be crucial for schema evolution and overall health. We had a Studio Data Architect already in the org who was focused on data modeling and alignment across Studio. We engaged with them to determine graph schema best practices to best suit the needs of Studio Engineering.

Our goal was to design a GraphQL schema that was reflective of the domain itself, not the database model. UI developers should not have to build Backends For Frontends (BFF) to massage the data for their needs, rather, they should help shape the schema so that it satisfies their needs. Embracing a collaborative schema design approach was essential to achieving this goal.

Schema Design Workflow Diagram
Schema Design Workflow

The collaborative design process involves feedback and reviews across team boundaries. To streamline schema design and review, we formed a schema working group and a managed technical program for on-boarding to the federated architecture. While reviews add overhead to the product development process, we believe that prioritizing the quality of the graph model will reduce the amount of future changes and reworking needed. The level of review varies based on the entities affected; for the core federated types, more rigor is required (though tooling helps streamline that flow).

We have a deprecation workflow in place for evolving the schema. We’ve leveraged GraphQL’s deprecation feature and also track usage stats for every field in the schema. Once the stats show that a deprecated field is no longer used, we can make a backward incompatible change to remove the field from the schema.

Clients with Deprecated Field Usage
Clients with Deprecated Field Usage

We embraced a schema-first approach instead of generating our schema from existing models such as the Protobuf objects in our gRPC APIs. While Protobufs and gRPC are excellent solutions for building service APIs, we prefer decoupling our GraphQL schema from those layers to enable cleaner graph design and independent evolvability. In some scenarios, we implement generic mapping code from GraphQL resolvers to gRPC calls, but the extra boilerplate is worth the long-term flexibility of the GraphQL API.

Underlying our approach is a foundation of “context over control”, which is a key tenet of Netflix’s culture. Instead of trying to hold tight control of the entire graph, we give guidance and context to product teams so that they can apply their domain knowledge to make a flexible API for their domain. As this architecture matures, we will continue to monitor schema health and develop new tooling, processes, and best practices where needed.

Observability

In our previous architecture, observability was achieved through manual analysis and routing via the API team, which scaled poorly. For our federated architecture, we prioritized solving observability needs in a more scalable manner. We prioritized three areas:

  • Alerting — report when something goes awry
  • Discovery — easily determine what isn’t working
  • Diagnosis — debug why something isn’t working

Our guiding metrics in this space are mean time to resolution (MTTR) and service level objectives and indicators (SLO/SLI).

We teamed up with experts from Netflix’s Telemetry team. We integrated the Gateway and DGS architectural components with Zipkin, the internal distributed tracing tool Edgar, and application monitoring tool TellTale. In GraphQL, almost every response is a 200 with custom errors in the error block. We introspect these custom error codes from the response and emit them to our metrics server, Atlas. These integrations created a great foundation of rich visibility and insights for the consumers and developers of the GraphQL API.

Trace for a Federated Request Lifecycle
Edgar Trace for a Federated Request Lifecycle
Timeline View for a Federated Request lifecycle
Timeline View for a Federated Request

Distributed Log Correlation helps with debugging more complex server issues. By surfacing the application level logging details for all systems involved in processing a request, we gain deeper insights into what happened across the stack. Developers can easily see what was happening around the same time as a given request, to inspect surrounding factors that might have impacted an interaction.

Log correlation across multiple services for a request lifecycle
Logs across multiple services for a Federated Request

To solve the “who do I ask about…” routing problem, we integrated deep linking from GraphQL types and fields to their owning team’s support channels. Finding support is now as simple as clicking a link from a trace, which helps shorten MTTR and reduce the number of times the gateway team needs to get involved.

Securing the Federated Graph

Our goal is to enable robust and consistent security practices across the federated architecture. To achieve this, we partnered with the security experts at Netflix to build security into the graph. Let’s look at two essential parts of our security solution: AuthN and AuthZ.

Authentication

All of our product experiences in the Studio space require an authenticated account, so we restrict the GraphQL Gateway access to only trusted authenticated callers. Additionally, Graph Introspection is restricted to Netflix internal developers.

Authorization

Before Studio Edge, authorization logic was fragmented across teams. Some teams implemented authorization in their BFFs, some in microservices, and others did both for good measure. The result was often a different authorization story for a given piece of data depending on which UI a user was accessing it through. UI teams also found themselves needing to implement (and re-implement) authorization checks with each new frontend.

In Studio Edge, we delegated the authorization responsibility to DGS owners. This resulted in consistent authorization for the same user across different applications. Plus, Product Managers, Engineers and the Security team can easily get a bird’s eye view of who has access to each data type and how.

We have multiple authorization offerings within Netflix: from a simple system that grants access based on user identity to a more granular system that brings in the concept of roles and capabilities. DGS developers can choose a solution based on their needs. Then they simply annotate their resolvers with @Secured annotation and configure that to use one of the available systems. If needed, more complex authorization can be implemented in the resolver or in downstream systems.

Future of Authorization

We are currently prototyping a GraphQL-aware authorization solution. The Schema Registry automatically generates Access Control Groups (ACGs) for each field and its corresponding type when its schema is registered. Product managers & DGS Engineers decide membership and rules for these generated ACGs. Since the ACGs map to a field in GraphQL, the DGS framework then automatically applies the rules associated with the ACG during execution.

Architecting for Failure

The GraphQL Gateway is the single entry point for all requests; a failure on the gateway can cause significant disruptions. Following Netflix engineering best practices, we assume failures will happen and design ways to mitigate the impact of those failures. These are our design principles for ensuring the gateway layer is resilient:

  1. Single purpose
  2. Stateless service
  3. Demand controlled
  4. Multi-region
  5. Sharded by functionality

First, we focus the responsibilities of the gateway layer on a single purpose: parse client queries, then build and execute query plans. By reducing the scope, we limit the range of problems that can occur. We aim to perform any additional resource-intensive operations off-box with the exception of logging and metrics. Taking on additional unrelated logic in the gateway layer could increase surface area for failures in this critical tier.

Second, we run multiple stateless instances of the gateway service. Any gateway instance is able to generate and execute a query plan for any request. When we do code changes to the gateway layer, we rigorously test them before rolling out to production.

Third, we seek to balance the resources each request consumes through applying demand control. We rate-limit callers to avoid overloading the underlying databases that are the source of most of our domain elements. We also run a static query cost calculation on all incoming queries and reject expensive queries to avoid gridlock in gateway and DGS resources. Our partners understand these tradeoffs and work with us to meet these requirements, reworking expensive queries and reducing high volume callers.

Fourth, we deploy our gateway layer to multiple AWS regions around the world. This allows us to limit the blast radius for problems that inevitably arise. When problems happen, we can fail over to another region to ensure our clients are minimally impacted.

Last, we deploy multiple functional shards of our gateway layer. The code is the same in each shard and incoming requests are routed based on category. For example, GraphQL subscriptions generally result in long-lived connections while Queries & Mutations are short-lived. We use a separate fleet of instances for Subscriptions so “running out of connections” does not affect the availability of Queries and Mutations.

There is more we can do to improve resilience. We have plans to do canary deployments and analysis for gateway deployments and, eventually, schema changes. Today, our gateway dynamically updates its schema by polling the schema registry. We are in the process of decoupling these by storing the federation config in a versioned S3 bucket, making the gateway resilient to schema registry failures.

Closing Thoughts

GraphQL and Federation have been a productivity multiplier for Studio applications. Motivated by this, we’ve recently prototyped using GraphQL Federation for the Netflix consumer app search page on iOS & Android. To do this, we created three DGSs to provide the data for a minimal portion of the consumer graph. We are sending a small subset of users to this alternative stack and measuring high-level metrics. We are excited to see the results and explore further applicability in the Netflix consumer space.

Despite our positive experience, GraphQL Federation is early in its maturity lifecycle and may not be the best fit for every team or organization. Learning GraphQL and DGS development, running a federation layer, and doing a migration requires high commitment from partner teams and seamless cross-functional collaboration. If you’re considering going in this direction, we recommend checking out Apollo’s SaaS offering for Federation and the many online resources for learning GraphQL. For ecosystems like ours with a large swath of microservices that need to be aggregated together, the development velocity and improved operability has made the transition worth it.

In closing, we want to hear from you! If you have already implemented federation or tried to solve this problem with another approach, we would love to learn more. Sharing knowledge is one of the ways our industry learns and improves rapidly. Finally, if you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare, Edited by Philip Fisher-Ogden

Additional Credits: Stephen Spalding, Jennifer Shin, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer


How Netflix Scales its API with GraphQL Federation (Part 2) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Netflix Scales its API with GraphQL Federation (Part 1)

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-scales-its-api-with-graphql-federation-part-1-ae3557c187e2

Netflix is known for its loosely coupled and highly scalable microservice architecture. Independent services allow for evolving at different paces and scaling independently. Yet they add complexity for use cases that span multiple services. Rather than exposing 100s of microservices to UI developers, Netflix offers a unified API aggregation layer at the edge.

UI developers love the simplicity of working with one conceptual API for a large domain. Back-end developers love the decoupling and resilience offered by the API layer. But as our business has scaled, our ability to innovate rapidly has approached an invisible asymptote. As we’ve grown the number of developers and increased our domain complexity, developing the API aggregation layer has become increasingly harder.

In order to address this rising problem, we’ve developed a federated GraphQL platform to power the API layer. This solves many of the consistency and development velocity challenges with minimal tradeoffs on dimensions like scalability and operability. We’ve successfully deployed this approach for Netflix’s studio ecosystem and are exploring patterns and adaptations that could work in other domains. We’re sharing our story to inspire others and encourage conversations around applicability elsewhere.

Case Study: Studio Edge

Intro to Studio Ecosystem

Netflix is producing original content at an accelerated pace. From the time a TV show or a movie is pitched to when it’s available on Netflix, a lot happens behind the scenes. This includes but is not limited to talent scouting and casting, deal and contract negotiations, production and post-production, visual effects and animations, subtitling and dubbing, and much more. Studio Engineering is building hundreds of applications and tools that power these workflows.

Netflix Studio Content Lifecycle
Content Lifecycle

Studio API

Looking back to a few years ago, one of the pains in the studio space was the growing complexity of the data and its relationships. The workflows depicted above are inherently connected but the data and its relationships were disparate and existed in myriads of microservices. The product teams solved for this with two architectural patterns.

1) Single-use aggregation layers — Due to the loose coupling, we observed that many teams spent considerable effort building duplicative data-fetching code and aggregation layers to support their product needs. This was either done by UI teams via BFF (Backend For Frontend) or by a backend team in a mid-tier service.

2) Materialized views for data from other teams — some teams used a pattern of building a materialized view of another service’s data for their specific system needs. Materialized views had performance benefits, but data consistency lagged by varying degrees. This was not acceptable for the most important workflows in the Studio. Inconsistent data across different Studio applications was the top support issue in Studio Engineering in 2018.

Graph API: To better address the underlying needs, our team started building a curated graph API called “Studio API”. Its goal was to provide an unified abstraction on top of data and relationships. Studio API used GraphQL as its underlying API technology and created significant leverage for accessing core shared data. Consumers of Studio API were able to explore the graph and build new features more quickly. We also observed fewer instances of data inconsistency across different UI applications, as every field in GraphQL resolves to a single piece of data-fetching code.

Studio API Graph
Studio API Graph
Studio API Architecture Diagram
Studio API Architecture

Bottlenecks of Studio API

The One Graph exposed by Studio API was a runaway success; product teams loved the reusability and easy, consistent data access. But new bottlenecks emerged as the number of consumers and amount of data in the graph increased.

First, the Studio API team was disconnected from the domain expertise and the product needs, which negatively impacted the schema’s health. Second, connecting new elements from a back-end into the graph API was manual and ran counter to the rapid evolution promised by a microservice architecture. Finally, it was hard for one small team to handle the increasing operational and support burden for the expanding graph.

We knew that there had to be a better way — unified but decoupled, curated but fast moving.

Returning to Core Principles

To address these bottlenecks, we leaned into our rich history of microservices and breaking monoliths apart. We still wanted to keep the unified GraphQL schema of Studio API but decentralize the implementation of the resolvers to their respective domain teams.

As we were brainstorming the new architecture back in early 2019, Apollo released the GraphQL Federation Specification. This promised the benefits of a unified schema with distributed ownership and implementation. We ran a test implementation of the spec with promising results, and reached out to collaborate with Apollo on the future of GraphQL Federation. Our next generation architecture, “Studio Edge”, emerged with federation as a critical element.

GraphQL Federation Primer

The goal of GraphQL Federation is two-fold: provide a unified API for consumers while also giving backend developers flexibility and service isolation. To achieve this, schemas need to be created and annotated to indicate how ownership is distributed. Let’s look at an example with three core entities:

  1. Movie: At Netflix, we make titles (shows, films, shorts etc.). For simplicity, let’s assume each title is a Movie object.
  2. Production: Each Movie is associated with a Studio Production. A Production object tracks everything needed to make a Movie including shooting location, vendors, and more.
  3. Talent: the people working on a Movie are the Talent, including actors, directors, and so on.

These three domains are owned by three separate engineering teams responsible for their own data sources, business logic, and corresponding microservices. In an unfederated implementation, we would have this simple Schema and Resolvers owned and implemented by the Studio API team. The GraphQL Framework would take in queries from clients and orchestrate the calls to the resolvers in a breadth-first traversal.

Schemas & Resolvers for Studio API
Schema & Resolvers for Studio API

To transition to a federated architecture, we need to transfer ownership of these resolvers to their respective domains without sacrificing the unified schema. To achieve this, we need to extend the Movie type across GraphQL service boundaries:

Federating the movie type
Federating Movie

This ability to extend a Movie type across GraphQL service boundaries makes Movie a Federated Type. Resolving a given field requires delegation by a gateway layer down to the owning domain services.

Studio Edge Architecture

Using the ability to federate a type, we envisioned the following architecture:

Studio Edge Architecture Diagram
Studio Edge Architecture

Key Architectural Components

Domain Graph Service (DGS) is a standalone spec-compliant GraphQL service. Developers define their own federated GraphQL schema in a DGS. A DGS is owned and operated by a domain team responsible for that subsection of the API. A DGS developer has the freedom to decide if they want to convert their existing microservice to a DGS or spin up a brand new service.

Schema Registry is a stateful component that stores all the schemas and schema changes for every DGS. It exposes CRUD APIs for schemas, which are used by developer tools and CI/CD pipelines. It is responsible for schema validation, both for the individual DGS schemas and for the combined schema. Last, the registry composes together the unified schema and provides it to the gateway.

GraphQL Gateway is primarily responsible for serving GraphQL queries to the consumers. It takes a query from a client, breaks it into smaller sub-queries (a query plan), and executes that plan by proxying calls to the appropriate downstream DGSs.

Implementation Details

There are 3 main business logic components that power GraphQL Federation.

Schema Composition

Composition is the phase that takes all of the federated DGS schemas and aggregates them into a single unified schema. This composed schema is exposed by the Gateway to the consumers of the graph.

Schema Composition Phases
Schema Composition Phases

Whenever a new schema is pushed by a DGS, the Schema Registry validates that:

  1. New schema is a valid GraphQL schema
  2. New schema composes seamlessly with the rest of the DGSs schemas to create a valid composed schema
  3. New schema is backwards compatible

If all of the above conditions are met, then the schema is checked into the Schema Registry.

Query Planning and Execution

The federation config consists of all the individual DGS schemas and the composed schema. The Gateway uses the federation config and the client query to generate a query plan. The query plan breaks down the client query into smaller sub-queries that are then sent to the downstream DGSs for execution, along with an execution ordering that includes what needs to be done in sequence versus run in parallel.

Query Plan Inputs
Query Plan Inputs

Let’s build a simple query from the schema referenced above and see what the query plan might look like.

Simplified Query Plan
Simplified Query Plan

For this query, the gateway knows which fields are owned by which DGS based on the federation config. Using that information, it breaks the client query into three separate queries to three DGSs. The first query is sent to Movie DGS since the root field movies is owned by that DGS. This results in retrieving the movieId and title fields for the first 10 movies in the dataset. Then using the movieIds it got from the previous request, the gateway executes two parallel requests to Production DGS and Talent DGS to fetch the production and actors fields for those 10 movies. Upon completion, the sub-query responses are merged together and the combined data response is returned to the caller.

A note on performance: Query Planning and Execution adds a ~10ms overhead in the worst case. This includes the compute for building the query plan, as well as the deserialization of DGS responses and the serialization of merged gateway response.

Entity Resolver

Now you might be wondering, how do the parallel sub-queries to Production and Talent DGS actually work? That’s not something that the DGS supports. This is the final piece of the puzzle.

Let’s go back to our federated type Movie. In order for the gateway to join Movie seamlessly across DGSs, all the DGSs that define and extend the Movie need to agree on one or more fields that define the primary key (e.g. movieId). To make this work, Apollo introduced the @key directive in the Federation Spec. Second, DGSs have to implement a resolver for a generic Query field, _entities. The _entities query returns a union type of all the federated types in that DGS. The gateway uses the _entities query to look up Movie by movieId.

Let’s take a look at how the query plan actually looks like

Detailed federated query plan
Detailed Federated Query Plan

The representation object consists of the movieId and is generated from the response of the first request to Movie DGS. Since we requested for the first 10 movies, we would have 10 representation objects to send to Production and Talent DGS.

This is similar to Relay’s Object Identification with a few differences. _Entity is a union type, while Relay’s Node is an interface. Also, with @key, there is support for variable key names and types as well as composite keys while in Relay, the id is a single opaque ID field.

Combined together, these are the ingredients that power the core of a federated API architecture.

The Journey, Summarized

Our Studio Ecosystem architecture has evolved in distinct phases, all motivated by reducing the time between idea and implementation, improving the developer experience, and streamlining operations. The architectural phases look like:

Evolution of an API Architecture
Evolution of an API Architecture

Stay Tuned

Over the past year we’ve implemented the federated API architecture components in our Studio Edge. Getting here required rapid iteration, lots of cross-functional collaborations, a few pivots, and ongoing investment. We’re live with 70 DGSes and hundreds of developers contributing to and using the Studio Edge architecture. In our next Netflix Tech Blog post, we’ll share what we learned along the way, including the cross-cutting concerns necessary to build a holistic solution.

We want to thank the entire GraphQL open-source community for all the generous contributions and paving the path towards the promise of GraphQL. If you’d like to be a part of solving complex and interesting problems like this at Netflix scale, check out our jobs page or reach out to us directly.

By Tejas Shikhare

Additional Credits: Stephen Spalding, Jennifer Shin, Philip Fisher-Ogden, Robert Reta, Antoine Boyer, Bruce Wang, David Simmer


How Netflix Scales its API with GraphQL Federation (Part 1) was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Announcing Network Analytics

Post Syndicated from Omer Yoachimik original https://blog.cloudflare.com/announcing-network-analytics/

Our Analytics Platform

Announcing Network Analytics

Back in March 2019, we released Firewall Analytics which provides insights into HTTP security events across all of Cloudflare’s protection suite; Firewall rule matches, HTTP DDoS Attacks, Site Security Level which harnesses Cloudflare’s threat intelligence, and more. It helps customers tailor their security configurations more effectively. The initial release was for Enterprise customers, however we believe that everyone should have access to powerful tools, not just large enterprises, and so in December 2019 we extended those same enterprise-level analytics to our Business and Pro customers.

Announcing Network Analytics
Source: https://imgflip.com/memegenerator

Since then, we’ve built on top of our analytics platform; improved the usability, added more functionality and extended it to additional Cloudflare services in the form of Account Analytics, DNS Analytics, Load Balancing Analytics, Monitoring Analytics and more.

Our entire analytics platform harnesses the powerful GraphQL framework which is also available to customers that want to build, export and share their own custom reports and dashboards.

Extending Visibility From L7 To L3

Until recently, all of our dashboards were mostly HTTP-oriented and provided visibility into HTTP attributes such as the user agent, hosts, cached resources, etc. This is valuable to customers that use Cloudflare to protect and accelerate HTTP applications, mobile apps, or similar. We’re able to provide them visibility into the application layer (Layer 7 in the OSI model) because we proxy their traffic at L7.

Announcing Network Analytics
DDoS Protection for Layer 3-7

However with Magic Transit, we don’t proxy traffic at L7 but rather route it at L3 (network layer). Using BGP Anycast, customer traffic is routed to the closest point of presence of Cloudflare’s network edge where it is filtered by customer-defined network firewall rules and automatic DDoS mitigation systems. Clean traffic is then routed via dynamic GRE Anycast tunnels to the customer’s data-centers. Routing at L3 means that we have limited visibility into the higher layers. So in order to provide Magic Transit customers visibility into traffic and attacks, we needed to extend our analytics platform to the packet-layer.

Announcing Network Analytics
Magic Transit Traffic Flow

On January 16, 2020, we released the Network Analytics dashboard for Magic Transit customers and Bring Your Own IP (BYOIP) customers. This packet and bit oriented dashboard provides near real-time visibility into network- and transport-layer traffic patterns and DDoS attacks that are blocked at the Cloudflare edge in over 200 cities around the world.

Announcing Network Analytics
Network Analytics – Packets & Bits Over Time

Analytics For A Year

The way we’ve architected the analytics data-stores enables us to provide our customers one year’s worth of insights. Traffic is sampled at the edge data-centers. From those samples we structure IP flow logs which are similar to SFlow and bucket one minute’s worth of traffic, grouped by destination IP, port and protocol. IP flows includes multiple packet attributes such as TCP flags, source IPs and ports, Cloudflare data-center, etc. The source IP is considered PII data and is therefore only stored for 30 days, after which the source IPs are discarded and logs are rolled up into one hour groups, and then one day groups. The one hour roll-ups are stored for 6 months and the one day roll-ups for 1 year.

Similarly, attack logs are also stored efficiently. Attacks are stored as summaries with start/end timestamps, min/max/average/total bits and packets per second, attack type, action taken and more. A DDoS attack could easily consist of billions of packets which could impact performance due to the number of read/write calls to the data-store. By storing attacks as summary logs, we’re able to overcome these challenges and therefore provide attack logs for up to 1 year back.

Network Analytics via GraphQL API

We built this dashboard on the same analytics platform, meaning that our packet-level analytics are also available by GraphQL. As an example, below is an attack report query that would show the top attacker IPs, the data-center cities and countries where the attack was observed, the IP version distribution, the ASNs that were used by the attackers and the ports. The query is done at the account level, meaning it would provide a report for all of your IP ranges. In order to narrow the report down to a specific destination IP or port range, you can simply add additional filters. The same filters also exist in the UI.

{
  viewer {
    accounts(filter: { accountTag: $accountTag }) {
      topNPorts: ipFlows1mGroups(
        limit: 5
        filter: $portFilter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourcePort
          ipProtocol
          __typename
        }
        __typename
      }
      topNASN: ipFlows1mGroups(
        limit: 5
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourceIPAsn
          description: sourceIPASNDescription
          __typename
        }
        __typename
      }
      topNIPs: ipFlows1mGroups(
        limit: 5
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: sourceIP
          __typename
        }
        __typename
      }
      topNColos: ipFlows1mGroups(
        limit: 10
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: coloCity
          coloCode
          __typename
        }
        __typename
      }
      topNCountries: ipFlows1mGroups(
        limit: 10
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: coloCountry
          __typename
        }
        __typename
      }
      topNIPVersions: ipFlows1mGroups(
        limit: 2
        filter: $filter
        orderBy: [sum_packets_DESC]
      ) {
        sum {
          count: packets
          __typename
        }
        dimensions {
          metric: ipVersion
          __typename
        }
        __typename
      }
      __typename
    }
    __typename
  }
}
Attack Report Query Example

After running the query using Altair GraphQL Client, the response is returned in a JSON format:

Announcing Network Analytics

What Do Customers Want?

As part of our product definition and design research stages, we interviewed internal customer-facing teams including Customer Support, Solution Engineering and more. I consider these stakeholders as super-user-aggregators because they’re customer-facing teams and are constantly engaging and helping our users. After the internal research phase, we expanded externally to customers and prospects; particularly network and security engineers and leaders. We wanted to know how they expect the dashboard to fit in their work-flow, what are their use cases and how we can tailor the dashboard to their needs. Long story short, we identified two main use cases: Incident Response and Reporting. Let’s go into each of these use cases in more detail.

Incident Response

I started off by asking them a simple question – “what do you do when you’re paged?” We wanted to better understand their incident response process; specifically, how they’d expect to use this dashboard when responding to an incident and what are the metrics that matter to them to help them make quick calculated decisions.

You’ve Just Been Paged

Let’s say that you’re a security operations engineer. It’s Black Friday. You’re on call. You’ve just been paged. Traffic levels to one of your data-centers has exceeded a safe threshold. Boom. What do you do? Responding quickly and resolving the issue as soon as possible is key.

If your workflows are similar to our customers’, then your objective is to resolve the page as soon as possible. However, before you can resolve it, you need to determine if there is any action that you need to take. For instance, is this a legitimate rise in traffic from excited Black Friday shoppers, perhaps a new game release or maybe an attack that hasn’t been mitigated? Do you need to shift traffic to another data-center or are levels still stable? Our customers tell us that these are the metrics that matter the most:

  1. Top Destination IP and port – helps understand what services are being impacted
  2. Top source IPs, port, ASN, data-center and data-center Country – helps identify the source of the traffic
  3. Real-time packet and bit rates – helps understand the traffic levels
  4. Protocol distribution – helps understand what type of traffic is abnormal
  5. TCP flag distribution – an abnormal distribution could indicate an attack
  6. Attack Log – shows what types of traffic is being dropped/rate-limited
Announcing Network Analytics
Customizable DDoS Attack Log

As network and transport layer attacks can be highly distributed and the packet attributes can be easily spoofed, it’s usually not practical to block IPs. Instead, the dashboard enables you to quickly identify patterns such as an increased frequency of a specific TCP flag or increased traffic from a specific country. Identifying these patterns brings you one step closer to resolving the issue. After you’ve identified the patterns, packet-level filtering can be applied to drop or rate-limit the malicious traffic. If the attack was automatically mitigated by Cloudflare’s systems, you’ll be able to see it immediately along with the attack attributes in the activity log. By filtering by the Attack ID, the entire dashboard becomes your attack report.

Announcing Network Analytics
Packet/Bit Distribution by Source & Destination
Announcing Network Analytics
TCP Flag Distribution

Reporting

During our interviews with security and network engineers, we also asked them what metrics and insights they need when creating reports for their managers, C-levels, colleagues and providing evidence to law-enforcement agencies. After all, processing data and creating reports can consume over a third (36%) of a security team’s time (~3 hours a day) and is also one of the most frequent DDoS asks by our customers.

Announcing Network Analytics
Add filters, select-time range, print and share

On top of all of these customizable insights, we wanted to also provide a one-line summary that would reflect your recent activity. The one-liner is dynamic and changes based on your activity. It tells you whether you’re currently under attack, and how many attacks were blocked. If your CISO is asking for a security update, you can simply copy-paste it and convey the efficiency of the service:

Announcing Network Analytics
Dynamic Summary

Our customers say that they want to reflect the value of Cloudflare to their managers and peers:

  1. How much potential downtime and bandwidth did Cloudflare spare me?
  2. What are my top attacked IPs and ports?
  3. Where are the attacks coming from? What types and what are the trends?

The Secret To Creating Good Reports

What does everyone love? Cool Maps! The key to a good report is adding a map; showing where the attack came from. But given that packet attributes can be easily spoofed, including the source IP, it won’t do us any good to plot a map based on the locations of the source IPs. It would result in a spoofed source country and is therefore useless data. Instead, we decided to show the geographic distribution of packets and bits based on the Cloudflare data-center in which they were ingested. As opposed to legacy scrubbing center solutions with limited network infrastructures, Cloudflare has data-centers in more than 200 cities around the world. This enables us to provide precise geographic distribution with high confidence, making your reports accurate.

Announcing Network Analytics
Packet/Bit Distribution by geography: Data-center City & Country

Tailored For You

One of the main challenges both our customers and we struggle with is how to process and generate actionable insights from all of the data points. This is especially critical when responding to an incident. Under this assumption, we built this dashboard with the purpose of speeding up your reporting and investigation processes. By tailoring it to your needs, we hope to make you more efficient and make the most out of Cloudflare’s services. Got any feedback or questions? Post them below in the comments section.

If you’re an existing Magic Transit or BYOIP customer, then the dashboard is already available to you. Not a customer yet? Click here to learn more.

How we used our new GraphQL Analytics API to build Firewall Analytics

Post Syndicated from Nick Downie original https://blog.cloudflare.com/how-we-used-our-new-graphql-api-to-build-firewall-analytics/

How we used our new GraphQL Analytics API to build Firewall Analytics

How we used our new GraphQL Analytics API to build Firewall Analytics

Firewall Analytics is the first product in the Cloudflare dashboard to utilize the new GraphQL Analytics API. All Cloudflare dashboard products are built using the same public APIs that we provide to our customers, allowing us to understand the challenges they face when interfacing with our APIs. This parity helps us build and shape our products, most recently the new GraphQL Analytics API that we’re thrilled to release today.

By defining the data we want, along with the response format, our GraphQL Analytics API has enabled us to prototype new functionality and iterate quickly from our beta user feedback. It is helping us deliver more insightful analytics tools within the Cloudflare dashboard to our customers.

Our user research and testing for Firewall Analytics surfaced common use cases in our customers’ workflow:

  • Identifying spikes in firewall activity over time
  • Understanding the common attributes of threats
  • Drilling down into granular details of an individual event to identify potential false positives

We can address all of these use cases using our new GraphQL Analytics API.

GraphQL Basics

Before we look into how to address each of these use cases, let’s take a look at the format of a GraphQL query and how our schema is structured.

A GraphQL query is comprised of a structured set of fields, for which the server provides corresponding values in its response. The schema defines which fields are available and their type. You can find more information about the GraphQL query syntax and format in the official GraphQL documentation.

To run some GraphQL queries, we recommend downloading a GraphQL client, such as GraphiQL, to explore our schema and run some queries. You can find documentation on getting started with this in our developer docs.

At the top level of the schema is the viewer field. This represents the top level node of the user running the query. Within this, we can query the zones field to find zones the current user has access to, providing a filter argument, with a zoneTag of the identifier of the zone we’d like narrow down to.

{
  viewer {
    zones(filter: { zoneTag: "YOUR_ZONE_ID" }) {
      # Here is where we'll query our firewall events
    }
  }
}

Now that we have a query that finds our zone, we can start querying the firewall events which have occurred in that zone, to help solve some of the use cases we’ve identified.

Visualising spikes in firewall activity

It’s important for customers to be able to visualise and understand anomalies and spikes in their firewall activity, as these could indicate an attack or be the result of a misconfiguration.

Plotting events in a timeseries chart, by their respective action, provides users with a visual overview of the trend of their firewall events.

Within the zones field in the query we’ve created earlier, we can query our firewall event aggregates using the firewallEventsAdaptiveGroups field, providing arguments to limit the count of groups, a filter for the date range we’re looking for (combined with any user-entered filters), and a list of fields to order by; in this case, just the datetimeHour field that we’re grouping by.

Within the zones field in the query we created earlier, we can further query our firewall event aggregates using the firewallEventsAdaptiveGroups field and providing arguments for:

  • A limit for the count of groups
  • A filter for the date range we’re looking for (combined with any user-entered filters)
  • A list of fields to orderBy (in this case, just the datetimeHour field that we’re grouping by).

By adding the dimensions field, we’re querying for groups of firewall events, aggregated by the fields nested within dimensions. In this case, our query includes the action and datetimeHour fields, meaning the response will be groups of firewall events which share the same action, and fall within the same hour. We also add a count field, to get a numeric count of how many events fall within each group.

query FirewallEventsByTime($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptiveGroups(
        limit: 576
        filter: $filter
        orderBy: [datetimeHour_DESC]
      ) {
        count
        dimensions {
          action
          datetimeHour
        }
      }
    }
  }
}

Note – Each of our groups queries require a limit to be set. A firewall event can have one of 8 possible actions, and we are querying over a 72 hour period. At most, we’ll end up with 567 groups, so we can set that as the limit for our query.

This query would return a response in the following format:

{
  "viewer": {
    "zones": [
      {
        "firewallEventsAdaptiveGroups": [
          {
            "count": 5,
            "dimensions": {
              "action": "jschallenge",
              "datetimeHour": "2019-09-12T18:00:00Z"
            }
          }
          ...
        ]
      }
    ]
  }
}

We can then take these groups and plot each as a point on a time series chart. Mapping over the firewallEventsAdaptiveGroups array, we can use the group’s count property on the y-axis for our chart, then use the nested fields within the dimensions object, using action as unique series and the datetimeHour as the time stamp on the x-axis.

How we used our new GraphQL Analytics API to build Firewall Analytics

Top Ns

After identifying a spike in activity, our next step is to highlight events with commonality in their attributes. For example, if a certain IP address or individual user agent is causing many firewall events, this could be a sign of an individual attacker, or could be surfacing a false positive.

Similarly to before, we can query aggregate groups of firewall events using the firewallEventsAdaptiveGroups field. However, in this case, instead of supplying action and datetimeHour to the group’s dimensions, we can add individual fields that we want to find common groups of.

By ordering by descending count, we’ll retrieve groups with the highest commonality first, limiting to the top 5 of each. We can add a single field nested within dimensions to group by it. For example, adding clientIP will give five groups with the IP addresses causing the most events.

We can also add a firewallEventsAdaptiveGroups field with no nested dimensions. This will create a single group which allows us to find the total count of events matching our filter.

query FirewallEventsTopNs($zoneTag: string, $filter: FirewallEventsAdaptiveGroupsFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      topIPs: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          clientIP
        }
      }
      topUserAgents: firewallEventsAdaptiveGroups(
        limit: 5
        filter: $filter
        orderBy: [count_DESC]
      ) {
        count
        dimensions {
          userAgent
        }
      }
      total: firewallEventsAdaptiveGroups(
        limit: 1
        filter: $filter
      ) {
        count
      }
    }
  }
}

Note – we can add the firewallEventsAdaptiveGroups field multiple times within a single query, each aliased differently. This allows us to fetch multiple different groupings by different fields, or with no groupings at all. In this case, getting a list of top IP addresses, top user agents, and the total events.

How we used our new GraphQL Analytics API to build Firewall Analytics

We can then reference each of these aliases in the UI, mapping over their respective groups to render each row with its count, and a bar which represents the proportion of total events, showing the proportion of all events each row equates to.

Are these firewall events false positives?

After users have identified spikes, anomalies and common attributes, we wanted to surface more information as to whether these have been caused by malicious traffic, or are false positives.

To do this, we wanted to provide additional context on the events themselves, rather than just counts. We can do this by querying the firewallEventsAdaptive field for these events.

Our GraphQL schema uses the same filter format for both the aggregate firewallEventsAdaptiveGroups field and the raw firewallEventsAdaptive field. This allows us to use the same filters to fetch the individual events which summate to the counts and aggregates in the visualisations above.

query FirewallEventsList($zoneTag: string, $filter: FirewallEventsAdaptiveFilter_InputObject) {
  viewer {
    zones(filter: { zoneTag: $zoneTag }) {
      firewallEventsAdaptive(
        filter: $filter
        limit: 10
        orderBy: [datetime_DESC]
      ) {
        action
        clientAsn
        clientCountryName
        clientIP
        clientRequestPath
        clientRequestQuery
        datetime
        rayName
        source
        userAgent
      }
    }
  }
}

How we used our new GraphQL Analytics API to build Firewall Analytics

Once we have our individual events, we can render all of the individual fields we’ve requested, providing users the additional context on event they need to determine whether this is a false positive or not.

That’s how we used our new GraphQL Analytics API to build Firewall Analytics, helping solve some of our customers most common security workflow use cases. We’re excited to see what you build with it, and the problems you can help tackle.

You can find out how to get started querying our GraphQL Analytics API using GraphiQL in our developer documentation, or learn more about writing GraphQL queries on the official GraphQL Foundation documentation.

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Post Syndicated from Filipp Nisenzoun original https://blog.cloudflare.com/introducing-the-graphql-analytics-api-exactly-the-data-you-need-all-in-one-place/

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Today we’re excited to announce a powerful and flexible new way to explore your Cloudflare metrics and logs, with an API conforming to the industry-standard GraphQL specification. With our new GraphQL Analytics API, all of your performance, security, and reliability data is available from one endpoint, and you can select exactly what you need, whether it’s one metric for one domain or multiple metrics aggregated for all of your domains. You can ask questions like “How many cached bytes have been returned for these three domains?” Or, “How many requests have all the domains under my account received?” Or even, “What effect did changing my firewall rule an hour ago have on the responses my users were seeing?”

The GraphQL standard also has strong community resources, from extensive documentation to front-end clients, making it easy to start creating simple queries and progress to building your own sophisticated analytics dashboards.

From many APIs…

Providing insights has always been a core part of Cloudflare’s offering. After all, by using Cloudflare, you’re relying on us for key parts of your infrastructure, and so we need to make sure you have the data to manage, monitor, and troubleshoot your website, app, or service. Over time, we developed a few key data APIs, including ones providing information regarding your domain’s traffic, DNS queries, and firewall events. This multi-API approach was acceptable while we had only a few products, but we started to run into some challenges as we added more products and analytics. We couldn’t expect users to adopt a new analytics API every time they started using a new product. In fact, some of the customers and partners that were relying on many of our products were already becoming confused by the various APIs.

Following the multi-API approach was also affecting how quickly we could develop new analytics within the Cloudflare dashboard, which is used by more people for data exploration than our APIs. Each time we built a new product, our product engineering teams had to implement a corresponding analytics API, which our user interface engineering team then had to learn to use. This process could take up to several months for each new set of analytics dashboards.

…to one

Our new GraphQL Analytics API solves these problems by providing access to all Cloudflare analytics. It offers a standard, flexible syntax for describing exactly the data you need and provides predictable, matching responses. This approach makes it an ideal tool for:

  1. Data exploration. You can think of it as a way to query your own virtual data warehouse, full of metrics and logs regarding the performance, security, and reliability of your Internet property.
  2. Building amazing dashboards, which allow for flexible filtering, sorting, and drilling down or rolling up. Creating these kinds of dashboards would normally require paying thousands of dollars for a specialized analytics tool. You get them as part of our product and can customize them for yourself using the API.

In a companion post that was also published today, my colleague Nick discusses using the GraphQL Analytics API to build dashboards. So, in this post, I’ll focus on examples of how you can use the API to explore your data. To make the queries, I’ll be using GraphiQL, a popular open-source querying tool that takes advantage of GraphQL’s capabilities.

Introspection: what data is available?

The first thing you may be wondering: if the GraphQL Analytics API offers access to so much data, how do I figure out what exactly is available, and how I can ask for it? GraphQL makes this easy by offering “introspection,” meaning you can query the API itself to see the available data sets, the fields and their types, and the operations you can perform. GraphiQL uses this functionality to provide a “Documentation Explorer,” query auto-completion, and syntax validation. For example, here is how I can see all the data sets available for a zone (domain):

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

If I’m writing a query, and I’m interested in data on firewall events, auto-complete will help me quickly find relevant data sets and fields:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Querying: examples of questions you can ask

Let’s say you’ve made a major product announcement and expect a surge in requests to your blog, your application, and several other zones (domains) under your account. You can check if this surge materializes by asking for the requests aggregated under your account, in the 30 minutes after your announcement post, broken down by the minute:

{
 viewer { 
   accounts (filter: {accountTag: $accountTag}) {
     httpRequests1mGroups(limit: 30, filter: {datetime_geq: "2019-09-16T20:00:00Z", datetime_lt: "2019-09-16T20:30:00Z"}, orderBy: [datetimeMinute_ASC]) {
	  dimensions {
		datetimeMinute
	  }
	  sum {
		requests
	  }
	}
   }
 }
}

Here is the first part of the response, showing requests for your account, by the minute:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Now, let’s say you want to compare the traffic coming to your blog versus your marketing site over the last hour. You can do this in one query, asking for the number of requests to each zone:

{
 viewer {
   zones(filter: {zoneTag_in: [$zoneTag1, $zoneTag2]}) {
     httpRequests1hGroups(limit: 2, filter: {datetime_geq: "2019-09-16T20:00:00Z",
datetime_lt: "2019-09-16T21:00:00Z"}) {
       sum {
         requests
       }
     }
   }
 }
}

Here is the response:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

Finally, let’s say you’re seeing an increase in error responses. Could this be correlated to an attack? You can look at error codes and firewall events over the last 15 minutes, for example:

{
 viewer {
   zones(filter: {zoneTag: $zoneTag}) {
     httpRequests1mGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       sum {
         responseStatusMap {
           edgeResponseStatus
           requests
         }
       }
     }
    firewallEventsAdaptiveGroups (limit: 100,
filter: {datetime_geq: "2019-09-16T21:00:00Z",
datetime_lt: "2019-09-16T21:15:00Z"}) {
       dimensions {
         action
       }
       count
     }
    }
  }
}

Notice that, in this query, we’re looking at multiple datasets at once, using a common zone identifier to “join” them. Here are the results:

Introducing the GraphQL Analytics API: exactly the data you need, all in one place

By examining both data sets in parallel, we can see a correlation: 31 requests were “dropped” or blocked by the Firewall, which is exactly the same as the number of “403” responses. So, the 403 responses were a result of Firewall actions.

Try it today

To learn more about the GraphQL Analytics API and start exploring your Cloudflare data, follow the “Getting started” guide in our developer documentation, which also has details regarding the current data sets and time periods available. We’ll be adding more data sets over time, so take advantage of the introspection feature to see the latest available.

Finally, to make way for the new API, the Zone Analytics API is now deprecated and will be sunset on May 31, 2020. The data that Zone Analytics provides is available from the GraphQL Analytics API. If you’re currently using the API directly, please follow our migration guide to change your API calls. If you get your analytics using the Cloudflare dashboard or our Datadog integration, you don’t need to take any action.

One more thing….

In the API examples above, if you find it helpful to get analytics aggregated for all the domains under your account, we have something else you may like: a brand new Analytics dashboard (in beta) that provides this same information. If your account has many zones, the dashboard is helpful for knowing summary information on metrics such as requests, bandwidth, cache rate, and error rate. Give it a try and let us know what you think using the feedback link above the new dashboard.

GraphQL Search Indexing

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/graphql-search-indexing-334c92e0d8d5?source=rss----2615bd06b42e---4

by Artem Shtatnov and Ravi Srinivas Ranganathan

Almost a year ago we described our learnings from adopting GraphQL on the Netflix Marketing Tech team. We have a lot more to share since then! There are plenty of existing resources describing how to express a search query in GraphQL and paginate the results. This post looks at the other side of search: how to index data and make it searchable. Specifically, how our team uses the relationships and schemas defined within GraphQL to automatically build and maintain a search database.

Marketing Tech at Netflix

Our goal is to promote Netflix’s content across the globe. Netflix has thousands of shows on the service, operates in over 190 countries, and supports around 30 languages. For each of these shows, countries, and languages we need to find the right creative that resonates with each potential viewer. Our team builds the tools to produce and distribute these marketing creatives at a global scale, powering 10s of billions of impressions every month!

Various creatives Marketing Tech supports

To enable our marketing stakeholders to manage these creatives, we need to pull together data that is spread across many services — GraphQL makes this aggregation easy.

As an example, our data is centered around a creative service to keep track of the creatives we build. Each creative is enhanced with more information on the show it promotes, and the show is further enhanced with its ranking across the world. Also, our marketing team can comment on the creative when adjustments are needed. There are many more relationships that we maintain, but we will focus on these few for the post.

GraphQL query before indexing

Challenges of Searching Decentralized Data

Displaying the data for one creative is helpful, but we have a lot of creatives to search across. If we produced only a few variations for each of the shows, languages, and countries Netflix supports, that would result in over 50 million total creatives. We needed a proper search solution.

The problem stems from the fact that we are trying to search data across multiple independent services that are loosely coupled. No single service has complete context into how the system works. Each service could potentially implement its own search database, but then we would still need an aggregator. This aggregator would need to perform more complex operations, such as searching for creatives by ranking even though the ranking data is stored two hops away in another service.

If we had a single database with all of the information in it, the search would be easy. We can write a couple join statements and where clauses: problem solved. Nevertheless, a single database has its own drawbacks, mainly, around limited flexibility in allowing teams to work independently and performance limitations at scale.

Another option would be to use a custom aggregation service that builds its own index of the data. This service would understand where each piece of data comes from, know how all of the data is connected, and be able to combine the data in a variety of ways. Apart from the indexing part, these characteristics perfectly describe the entity relationships in GraphQL.

Indexing the Data

Since we already use GraphQL, how can we leverage it to index our data? We can update our GraphQL query slightly to retrieve a single creative and all of its related data, then call that query once for each of the creatives in our database, indexing the results into Elasticsearch. By batching and parallelizing the requests to retrieve many creatives via a single query to the GraphQL server, we can optimize the index building process.

GraphQL query for indexing

Elasticsearch has a lot of customization options when indexing data, but in many cases the default settings give pretty good results. At a minimum, we extract all of the type definitions from the GraphQL query and map them to a schema for Elasticsearch to use.

The nice part about using a GraphQL query to generate the schema is that any existing clients relying on this data will get the same shape of data regardless of whether it comes from the GraphQL server or the search index directly.

Once our data is indexed, we can sort, group, and filter on arbitrary fields; provide typeahead suggestions to our users; display facets for quick filtering; and progressively load data to provide an infinite scroll experience. Best of all, our page can load much faster since everything is cached in Elasticsearch.

Keeping Everything Up To Date

Indexing the data once isn’t enough. We need to make sure that the index is always up to date. Our data changes constantly — marketing users make edits to creatives, our recommendation algorithm refreshes to give the latest title popularity rankings and so on. Luckily, we have Kafka events that are emitted each time a piece of data changes. The first step is to listen to those events and act accordingly.

When our indexer hears a change event it needs to find all the creatives that are affected and reindex them. For example, if a title ranking changes, we need to find the related show, then its corresponding creative, and reindex it. We could hardcode all of these rules, but we would need to keep these rules up to date as our data evolves and for each new index we build.

Fortunately, we can rely on GraphQL’s entity relationships to find exactly what needs to be reindexed. Our search indexer understands these relationships by accessing a shared GraphQL schema or using an introspection query to retrieve the schema.

Our GraphQL query represented as a tree

In our earlier example, the indexer can fan out one level from title ranking to show by automatically generating a query to GraphQL to find shows that are related to the changed title ranking. After that, it queries Elasticsearch using the show and title ranking data to find creatives that reference these values. It can reindex those creatives using the same pipeline used to index them in the first place. What makes this method so great is that after defining GraphQL schemas and resolvers once, there is no additional work to do. The graph has enough data to keep the search index up to date.

Inverted Graph Index

Let’s look a bit deeper into the three steps the search indexer conducts: fan out, search, and index. As an example, if the algorithm starts recommending show 80186799 in Finland, the indexer would generate a GraphQL query to find the immediate parent: the show that the algorithm data is referring to. Once it finds that this recommendation is for Stranger Things, it would use Elasticsearch’s inverted index to find all creatives with show Stranger Things or with the algorithm recommendation data. The creatives are updated via another call to GraphQL and reindexed back to Elasticsearch.

The fan out step is needed in cases where the vertex update causes new edges to be created. If our algorithm previously didn’t have enough data to rank Stranger Things in Finland, the search step alone would never find this data in our index. Also, the fan out step does not need to perform a full graph search. Since GraphQL resolvers are written to only rely on data from the immediate parent, any vertex change can only impact its own edges. The combination of the single graph traversal and searching via an inverted index allows us to greatly increase performance for more complex graphs.

The fanout + search pattern works with more complex graphs

The indexer currently reruns the same GraphQL query that we used to first build our index, but we can optimize this step by only retrieving changes from the parent of the changed vertex and below. We can also optimize by putting a queue in front of both the change listener and the reindexing step. These queues debounce, dedupe, and throttle tasks to better handle spikes in workload.

The overall performance of the search indexer is fairly good as well. Listening to Kafka events adds little latency, our fan out operations are really quick since we store foreign keys to identify the edges, and looking up data in an inverted index is fast as well. Even with minimal performance optimizations, we have seen median delays under 500ms. The great thing is that the search indexer runs in close to constant time after a change, and won’t slow down as the amount of data grows.

Periodic Indexing

We run a full indexing job when we define a new index or make breaking schema changes to an existing index.

In the latter case, we don’t want to entirely wipe out the old index until after verifying that the newly indexed data is correct. For this reason, we use aliases. Whenever we start an indexing job, the indexer always writes the data to a new index that is properly versioned. Additionally, the change events need to be dual written to the new index as it is being built, otherwise, some data will be lost. Once all documents have been indexed with no errors, we swap the alias from the currently active index to the newly built index.

In cases where we can’t fully rely on the change events or some of our data does not have a change event associated with it, we run a periodic job to fully reindex the data. As part of this regular reindexing job, we compare the new data being indexed with the data currently in our index. Keeping track of which fields changed can help alert us of bugs such as a change events not being emitted or hidden edges not modeled within GraphQL.

Initial Setup

We built all of this logic for indexing, communicating with GraphQL, and handling changes into a search indexer service. In order to set up the search indexer there are a few requirements:

  1. Kafka. The indexer needs to know when changes happen. We use Kafka to handle change events, but any system that can notify the indexer of a change in the data would be sufficient.
  2. GraphQL. To act on the change, we need a GraphQL server that supports introspection. The graph has two requirements. First, each vertex must have a unique ID to make it easily identifiable by the search step. Second, for fan out to work, edges in the graph must be bidirectional.
  3. Elasticsearch. The data needs to be stored in a search database for quick retrieval. We use Elasticsearch, but there are many other options as well.
  4. Search Indexer. Our indexer combines the three items above. It is configured with an endpoint to our GraphQL server, a connection to our search database, and mappings from our Kafka events to the vertices in the graph.
How our search indexer is wired up

Building a New Index

After the initial setup, defining a new index and keeping it up to date is easy:

  1. GraphQL Query. We need to define the GraphQL query that retrieves the data we want to index.
  2. That’s it.

Once the initial setup is complete, defining a GraphQL query is the only requirement for building a new index. We can define as many indices as needed, each having its own query. Optionally, since we want to reindex from scratch, we need to give the indexer a way to paginate through all of the data, or tell it to rely on the existing index to bootstrap itself. Also, if we need custom mappings for Elasticsearch, we would need to define the mappings to mirror the GraphQL query.

The GraphQL query defines the fields we want to index and allows the indexer to retrieve data for those fields. The relationships in GraphQL allow keeping the index up to date automatically.

Where the Index Fits

The output of the search indexer feeds into an Elasticsearch database, so we needed a way to utilize it. Before we indexed our data, our browser application would call our GraphQL server, asking it to aggregate all of the data, then we filtered it down on the client side.

Data flow before indexing

After indexing, the browser can now call Elasticsearch directly (or via a thin wrapper to add security and abstract away database complexities). This setup allows the browser to fully utilize the search functionality of Elasticsearch instead of performing searches on the client. Since the data is the same shape as the original GraphQL query, we can rely on the same auto-generated Typescript types and don’t need major code changes.

Data flow after indexing

One additional layer of abstraction we are considering, but haven’t implemented yet, is accessing Elasticsearch via GraphQL. The browser would continue to call the GraphQL server in the same way as before. The resolvers in GraphQL would call Elasticsearch directly if any search criteria are passed in. We can even implement the search indexer as middleware within our GraphQL server. It would enhance the schema for data that is indexed and intercept calls when searches need to be performed. This approach would turn search into a plugin that can be enable on any GraphQL server with minimal configuration.

Using GraphQL to abstract away Elasticsearch

Caveats

Automatically indexing key queries on our graph has yielded tremendously positive results, but there are a few caveats to consider.

Just like with any graph, supernodes may cause problems. A supernode is a vertex in the graph that has a disproportionately large number of edges. Any changes that affect a supernode will force the indexer to reindex many documents, blocking other changes from being reindexed. The indexer needs to throttle any changes that affect too many documents to keep the queue open for smaller changes that only affect a single document.

The relationships defined in GraphQL are key to determining what to reindex if a change occurred. A hidden edge, an edge not defined fully by one of the two vertices it connects, can prevent some changes from being detected. For example, if we model the relationship between creatives and shows via a third table containing tuples of creative IDs and show IDs, that table would either need to be represented in the graph or its changes attributed to one of the vertices it connects.

By indexing data into a single store, we lose the ability to differentiate user specific aspects of the data. For example, Elasticsearch cannot store unread comment count per user for each of the creatives. As a workaround, we store the total comment count per creative in Elasticsearch, then on page load make an additional call to retrieve the unread counts for the creatives with comments.

Many UI applications practice a pattern of read after write, asking the server to provide the latest version of a document after changes are made. Since the indexing process is asynchronous to avoid bottlenecks, clients would no longer be able to retrieve the latest data from the index immediately after making a modification. On the other hand, since our indexer is constantly aware of all changes, we can expose a websocket connection to the client that notifies it when certain documents change.

The performance savings from indexing come primarily from the fact that this approach shifts the workload of aggregating and searching data from read time to write time. If the application exhibits substantially more writes than reads, indexing the data might create more of a performance hit.

The underlying assumption of indexing data is that you need robust search functionality, such as sorting, grouping, and filtering. If your application doesn’t need to search across data, but merely wants the performance benefits of caching, there are many other options available that can effectively cache GraphQL queries.

Finally, if you don’t already use GraphQL or your data is not distributed across multiple databases, there are plenty of ways to quickly perform searches. A few table joins in a relational database provide pretty good results. For larger scale, we’re building a similar graph-based solution that multiple teams across Netflix can leverage which also keeps the search index up to date in real time.

There are many other ways to search across data, each with its own pros and cons. The best thing about using GraphQL to build and maintain our search index is its flexibility, ease of implementation, and low maintenance. The graphical representation of data in GraphQL makes it extremely powerful, even for use cases we hadn’t originally imagined.

If you’ve made it this far and you’re also interested in joining the Netflix Marketing Technology team to help conquer our unique challenges, check out the open positions listed on our page. We’re hiring!


GraphQL Search Indexing was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a GraphQL server on the edge with Cloudflare Workers

Post Syndicated from Kristian Freeman original https://blog.cloudflare.com/building-a-graphql-server-on-the-edge-with-cloudflare-workers/

Building a GraphQL server on the edge with Cloudflare Workers

Building a GraphQL server on the edge with Cloudflare Workers

Today, we’re open-sourcing an exciting project that showcases the strengths of our Cloudflare Workers platform: workers-graphql-server is a batteries-included Apollo GraphQL server, designed to get you up and running quickly with GraphQL.

Building a GraphQL server on the edge with Cloudflare Workers
Testing GraphQL queries in the GraphQL Playground

As a full-stack developer, I’m really excited about GraphQL. I love building user interfaces with React, but as a project gets more complex, it can become really difficult to manage how your data is managed inside of an application. GraphQL makes that really easy – instead of having to recall the REST URL structure of your backend API, or remember when your backend server doesn’t quite follow REST conventions – you just tell GraphQL what data you want, and it takes care of the rest.

Cloudflare Workers is uniquely suited as a platform to being an incredible place to host a GraphQL server. Because your code is running on Cloudflare’s servers around the world, the average latency for your requests is extremely low, and by using Wrangler, our open-source command line tool for building and managing Workers projects, you can deploy new versions of your GraphQL server around the world within seconds.

If you’d like to try the GraphQL server, check out a demo GraphQL playground, deployed on Workers.dev. This optional add-on to the GraphQL server allows you to experiment with GraphQL queries and mutations, giving you a super powerful way to understand how to interface with your data, without having to hop into a codebase.

If you’re ready to get started building your own GraphQL server with our new open-source project, we’ve added a new tutorial to our Workers documentation to help you get up and running – check it out here!

Finally, if you’re interested in how the project works, or want to help contribute – it’s open-source! We’d love to hear your feedback and see your contributions. Check out the project on GitHub.