Big Data – Noise

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data…

Netflix Technology Blog — Fri, 17 Oct 2025 18:42:37 +0000

How and Why Netflix Built a Real-Time Distributed Graph: Part 1 — Ingesting and Processing Data Streams at Internet ScaleAuthors: Adrian Taruc and James DaltonThis is the first entry of a multi-part blog series describing how we built a Real-Time Distr...

Scaling Muse: How Netflix Powers Data-Driven Creative Insights at Trillion-Row Scale

Netflix Technology Blog — Mon, 22 Sep 2025 21:24:20 +0000

By Andrew Pierce, Chris Thrailkill, Victor ChiapaikeoAt Netflix, we prioritize getting timely data and insights into the hands of the people who can act on them. One of our key internal applications for this purpose is Muse. Muse’s ultimate goal is to ...

Export JMX metrics from Kafka connectors in Amazon Managed Streaming for Apache Kafka Connect with a custom plugin

Jaydev Nath — Fri, 15 Aug 2025 15:51:01 +0000

In this post, we demonstrate how you can export the JMX metrics for Debezium connector when used with Amazon MSK Connect.

Express brokers for Amazon MSK: Turbo-charged Kafka scaling with up to 20 times faster performance

Masudur Rahaman Sayem — Fri, 07 Mar 2025 16:49:21 +0000

In this post, we walk you through the implementation of MSK Express brokers, highlighting their core features, benefits, and best practices for rapid Kafka scaling.

Let’s Architect! Modern data architectures

Luca Mezzalira — Tue, 05 Nov 2024 22:31:27 +0000

Data is the fuel for AI; modern data is even more important for generative AI and advanced data analytics, producing more accurate, relevant, and impactful results. Modern data comes in various forms: real-time, unstructured, or user-generated. Each form requires a different solution. AWS’s data journey began with Amazon Simple Storage Service (Amazon S3) in 2006, […]

Modernize your legacy databases with AWS data lakes, Part 2: Build a data lake using AWS DMS data on Apache Iceberg

Shaheer Mansoor — Wed, 30 Oct 2024 20:15:02 +0000

This is part two of a three-part series where we show how to build a data lake on AWS using a modern data architecture. This post shows how to load data from a legacy database (SQL Server) into a transactional data lake (Apache Iceberg) using AWS Glue. We show how to build data pipelines using AWS Glue jobs, optimize them for both cost and performance, and implement schema evolution to automate manual tasks. To review the first part of the series, where we load SQL Server data into Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS), see Modernize your legacy databases with AWS data lakes, Part 1: Migrate SQL Server using AWS DMS.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

Asser Moustafa — Wed, 23 Oct 2024 15:52:23 +0000

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift. We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision.

Building a scalable streaming data platform that enables real-time and batch analytics of electric vehicles on AWS

Ayush Agrawal — Wed, 17 Jul 2024 16:53:05 +0000

The automobile industry has undergone a remarkable transformation because of the increasing adoption of electric vehicles (EVs). EVs, known for their sustainability and eco-friendliness, are paving the way for a new era in transportation. As environmental concerns and the push for greener technologies have gained momentum, the adoption of EVs has surged, promising to reshape […]

Detect and handle data skew on AWS Glue

Salim Tutuncu — Wed, 01 May 2024 16:27:24 +0000

AWS Glue is a fully managed, serverless data integration service provided by Amazon Web Services (AWS) that uses Apache Spark as one of its backend processing engines (as of this writing, you can use Python Shell, Spark, or Ray). Data skew occurs when the data being processed is not evenly distributed across the Spark cluster, […]

How the GoDaddy data platform achieved over 60% cost reduction and 50% performance boost by adopting Amazon EMR Serverless

Brandon Abear — Tue, 12 Mar 2024 16:01:54 +0000

This is a guest post co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy. GoDaddy empowers everyday entrepreneurs by providing all the help and tools to succeed online. With more than 20 million customers worldwide, GoDaddy is the place people come to name their ideas, build a professional website, attract customers, […]

Sliding window rate limits in distributed systems

Grab Tech — Thu, 14 Dec 2023 00:00:10 +0000

Like many other companies, Grab uses marketing communications to notify users of promotions or other news. If a user receives these notifications from multiple companies, it would be a form of information overload and they might even start considering ...

Road localisation in GrabMaps

Grab Tech — Fri, 17 Nov 2023 00:00:10 +0000

Introduction In 2022, Grab achieved self-sufficiency in its Geo services. As part of this transition, one crucial step was moving towards using an internally-developed map tailored specifically to the market in which Grab operates. Now that we have fu...

Streaming SQL in Data Mesh

Netflix Technology Blog — Fri, 03 Nov 2023 21:48:50 +0000

Democratizing Stream Processing @ NetflixBy Guil Pires, Mark Cho, Mingliang Liu, Sujay JainData powers much of what we do at Netflix. On the Data Platform team, we build the infrastructure used across the company to process data at scale.In our last bl...

Building hyperlocal GrabMaps

Grab Tech — Wed, 30 Aug 2023 00:00:10 +0000

Introduction Southeast Asia (SEA) is a dynamic market, very different from other parts of the world. When travelling on the road, you may experience fast-changing road restrictions, new roads appearing overnight, and high traffic congestion. To addres...

Let’s Architect! Architecting a data mesh

Luca Mezzalira — Wed, 08 Mar 2023 16:20:53 +0000

Data architectures were mainly designed around technologies rather than business domains in the past. This changed in 2019, when Zhamak Dehghani introduced the data mesh. Data mesh is an application of the Domain-Driven-Design (DDD) principles to data architectures: Data is organized into data domains and the data is the product that the team owns and […]

AWS Local Zones and AWS Outposts, choosing the right technology for your edge workload

Sheila Busser — Thu, 01 Dec 2022 18:25:40 +0000

This blog post is written by Joe Sacco, Senior Technical Account Manager. The AWS Global Cloud Infrastructure includes 30 Launched Regions, 96 Availability Zones (AZs), 410+ Points of Presence with 400+ Edge Locations, and 13 Regional Edge Caches. With over 200 AWS services, most customer workloads can run in the AWS Regions. However, for some […]

Let’s Architect! Modern data architectures

Luca Mezzalira — Wed, 07 Sep 2022 14:54:48 +0000

With the rapid growth in data coming from data platforms and applications, and the continuous improvements in state-of-the-art machine learning algorithms, data are becoming key assets for companies. Modern data architectures include data mesh—a recent style that represents a paradigm shift, in which data is treated as a product and data architectures are designed around […]

Interactively develop your AWS Glue streaming ETL jobs using AWS Glue Studio notebooks

Arun A K — Thu, 01 Sep 2022 18:43:52 +0000

Enterprise customers are modernizing their data warehouses and data lakes to provide real-time insights, because having the right insights at the right time is crucial for good business outcomes. To enable near-real-time decision-making, data pipelines need to process real-time or near-real-time data. This data is sourced from IoT devices, change data capture (CDC) services like […]

From centralized architecture to decentralized architecture: How data sharing fine-tunes Amazon Redshift workloads

Jingbin Ma — Tue, 16 Aug 2022 17:53:16 +0000

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift has become the most widely used cloud data warehouse. With the significant […]

Bias in the machine: How can we address gender bias in AI?

Sue Sentance — Tue, 08 Mar 2022 09:42:15 +0000

At the Raspberry Pi Foundation, we’ve been thinking about questions relating to artificial intelligence (AI) education and data science education for several months now, inviting experts to share their perspectives in a series of very well-attended seminars. At the same time, we’ve been running a programme of research trials to find out what interventions in…

The post Bias in the machine: How can we address gender bias in AI? appeared first on Raspberry Pi.