By Chris Larsen, Architect
OpenTSDB is one of the first dedicated open source time series databases built on top of Apache HBase and the Hadoop Distributed File System. Today, we are proud to share that version 2.4.0 is now available and has many new features developed in-house and with contributions from the open source community. This release would not have been possible without support from our monitoring team, the Hadoop and HBase developers, as well as contributors from other companies like Salesforce, Alibaba, JD.com, Arista and more. Thank you to everyone who contributed to this release!
A few of the exciting new features include:
Rollup and Pre-Aggregation Storage
As time series data grows, storing the original measurements becomes expensive. Particularly in the case of monitoring workflows, users rarely care about last years’ high fidelity data. It’s more efficient to store lower resolution “rollups” for longer periods, discarding the original high-resolution data. OpenTSDB now supports storing and querying such data so that the raw data can expire from HBase or Bigtable, and the rollups can stick around longer. Querying for long time ranges will read from the lower resolution data, fetching fewer data points and speeding up queries.
Likewise, when a user wants to query tens of thousands of time series grouped by, for example, data centers, the TSD will have to fetch and process a significant amount of data, making queries painfully slow. To improve query speed, pre-aggregated data can be stored and queried to fetch much less data at query time, while still retaining the raw data. We have an Apache Storm pipeline that computes these rollups and pre-aggregates, and we intend to open source that code in 2019. For more details, please visit http://opentsdb.net/docs/build/html/user_guide/rollups.html.
Histograms and Sketches
When monitoring or performing data analysis, users often like to explore percentiles of their measurements, such as the 99.9th percentile of website request latency to detect issues and determine what consumers are experiencing. Popular metrics collection libraries will happily report percentiles for the data they collect. Yet while querying for the original percentile data for a single time series is useful, trying to query and combine the data from multiple series is mathematically incorrect, leading to errant observations and problems. For example, if you want the 99.9th percentile of latency in a particular region, you can’t just sum or recompute the 99.9th of the 99.9th percentile.
To solve this issue, we needed a complex data structure that can be combined to calculate an accurate percentile. One such structure that has existed for a long time is the bucketed histogram, where measurements are sliced into value ranges and each range maintains a count of measurements that fall into that bucket. These buckets can be sized based on the required accuracy and the counts from multiple sources (sharing the same bucket ranges) combined to compute an accurate percentile.
Bucketed histograms can be expensive to store for highly accurate data, as many buckets and counts are required. Additionally, many measurements don’t have to be perfectly accurate but they should be precise. Thus another class of algorithms could be used to approximate the data via sampling and provide highly precise data with a fixed interval. Data scientists at Yahoo (now part of Oath) implemented a great Java library called Data Sketches that implements the Stochastic Streaming Algorithms to reduce the amount of data stored for high-throughput services. Sketches have been a huge help for the OLAP storage system Druid (also sponsored by Oath) and Bullet, Oath’s open source real-time data query engine.
The latest TSDB version supports bucketed histograms, Data Sketches, and T-Digests.
Some additional features include:
- HBase Date Tiered Compaction support to improve storage efficiency.
- A new authentication plugin interface to support enterprise use cases.
- An interface to support fetching data directly from Bigtable or HBase rows using a search index such as ElasticSearch. This improves queries for small subsets of high cardinality data and we’re working on open sourcing our code for the ES schema.
- Greater UID cache controls and an optional LRU implementation to reduce the amount of JVM heap allocated to UID to string mappings.
- Configurable query size and time limits to avoid OOMing a JVM with large queries.
Additionally, we’ve started on 3.0, which is a rewrite that will support a slew of new features including:
- Querying and analyzing data from the plethora of new time series stores.
- A fully configurable query graph that allows for complex queries OpenTSDB 1x and 2x couldn’t support.
- Streaming results to improve the user experience and avoid overwhelming a single query node.
- Advanced analytics including support for time series forecasting with Yahoo’s EGADs library.
Please join us in testing out the current 3.0 code, reporting bugs, and adding features.
By Andy Wick, Chief Architect, Oath & Elyse Rinne, Software Engineer, Oath
Last month, our Moloch team hosted the second all day Moloch conference at our Dulles, Virginia campus. Moloch, the large-scale, full packet capturing, indexing, and database system was developed by Andy Wick at AOL (now part of Oath) in 2011 and open-sourced in 2012. Elyse Rinne joined the Moloch team in 2016 to enhance the tool’s front-end features. The project enjoys an active community of users and contributors.
Most recently, on November 1, more than 80 Moloch users and developers joined the Moloch core team to discuss the latest features, administrative capabilities, and clever uses of Moloch.
Speakers from Elastic, SANS, Cox, SecureOps, and Oath presented their experiences setting up and using Moloch in a variety of security-focused scenarios. Afterwards, the participants brainstormed new project features and enhancements. We ended with happy hour giving a chance to relax and network. Although most of the talks were not recorded due to the sensitive topics related to blue team security tactics in some of the presentations, we do have these presentation recordings and slides that are cleared for the public:
- Recent Changes to Moloch – Video & Slides.
- Moloch Deployments at Oath – Video & Slides.
- Using Wise – Video & Slides.
- All Presentations (including external and 2017 MolochON presentations)
If you are a blue team security professional, consider joining the Moloch community, use and help contribute to the project, and chat with us on Slack. To get started, check out our README and FAQ pages on GitHub.
Today we’re kicking off a blog post series of need-to-know updates on Vespa, summarizing the features and fixes detailed in Github issues.
We welcome your contributions and feedback about any new features or improvements you’d like to see.
For December, we’re excited to share the following product news:
Streaming Search Performance Improvement
Streaming Search is a solution for applications where each query only searches a small, statically determined subset of the corpus. In this case, Vespa searches without building reverse indexes, reducing storage cost and making writes more efficient. With the latest changes, the document type is used to further limit data scanning, resulting in lower latencies and higher throughput. Read more here.
ONNX is an open ecosystem for interchangeable AI models. Vespa now supports importing models in the ONNX format and transforming the models into Tensors for use in ranking. This adds to the TensorFlow import included earlier this year and allows Vespa to support many training tools. While Vespa’s strength is real-time model evaluation over large datasets, to get started using single data points, try the stateless model evaluation API. Explore this integration more in Ranking with ONNX models.
Precise Transaction Log Pruning
Vespa is built for large applications running continuous integration and deployment. This means nodes restart often for software upgrades, and node restart time matters. A common pattern is serving while restarting hosts one by one. Vespa has optimized transaction log pruning with prepareRestart, due to flushing as much as possible before stopping, which is quicker than replaying the same data after restarting. This feature is on by default. Learn more in live upgrade and prepareRestart.
Grouping on Maps
Grouping is used to implement faceting. Vespa has added support to group using map attribute fields, creating a group for values whose keys match the specified key, or field values referenced by the key. This support is useful to create indirections and relations in data and is great for use cases with structured data like e-commerce. Leverage key values instead of field names to simplify the search definition. Read more in Grouping on Map Attributes.
By Ohad Shacham, Yonatan Gottesman, Edward Bortnikov
Scalable Systems Research, Verizon/Oath
Omid, an open source transaction processing platform for Big Data, was born as a research project at Yahoo (now part of Verizon), and became an Apache Incubator project in 2015. Omid complements Apache HBase, a distributed key-value store in Apache Hadoop suite, with a capability to clip multiple operations into logically indivisible (atomic) units named transactions. This programming model has been extremely popular since the dawn of SQL databases, and has more recently become indispensable in the NoSQL world. For example, it is the centerpiece for dynamic content indexing of search and media products at Verizon, powering a web-scale content management platform since 2015.
Today, we are excited to share a new chapter in Omid’s history. Thanks to its scalability, reliability, and speed, Omid has been selected as transaction management provider for Apache Phoenix, a real-time converged OLTP and analytics platform for Hadoop. Phoenix provides a standard SQL interface to HBase key-value storage, which is much simpler and in many cases more performant than the native HBase API. With Phoenix, big data and machine learning developers get the best of all worlds: increased productivity coupled with high scalability. Phoenix is designed to scale to 10,000 query processing nodes in one instance and is expected to process hundreds of thousands or even millions of transactions per second (tps). It is widely used in the industry, including by Alibaba, Bloomberg, PubMatic, Salesforce, Sogou and many others.
We have just released a new and significantly improved version of Omid (1.0.0), the first major release since its original launch. We have extended the system with multiple functional and performance features to power a modern SQL database technology, ready for deployment on both private and public cloud platforms.
A few of the significant innovations include:
Protocol re-design for low latency
The early version of Omid was designed for use in web-scale data pipeline systems, which are throughput-oriented by nature. We re-engineered Omid’s internals to now support new ultra-low-latency OLTP (online transaction processing) applications, like messaging and algo-trading. The new protocol, Omid Low Latency (Omid LL), dissipates Omid’s major architectural bottleneck. It reduces the latency of short transactions by 5 times under light load, and by 10 to 100 times under heavy load. It also scales the overall system throughput to 550,000 tps while remaining within real-time latency SLAs. The figure below illustrates Omid LL scaling versus the previous version of Omid, for short and long transactions.
Throughput vs latency, transaction size=1 op
Throughput vs latency, transaction size=10 ops
Figure 1. Omid LL scaling versus legacy Omid. The throughput scales beyond 550,000 tps while the latency remains flat (low milliseconds).
ANSI SQL support
Phoenix provides secondary indexes for SQL tables — a centerpiece tool for efficient access to data by multiple keys. The CREATE INDEX command is on-demand; it is not allowed to block already deployed applications. We added Omid support for accomplishing this without impeding concurrent database operations or sacrificing consistency. We further introduced a mechanism to avoid recursive read-your-own-writes scenarios in complex queries, like “INSERT INTO T … SELECT FROM T …” statements. This was achieved by extending Omid’s traditional Snapshot Isolation consistency model, which provides single-read-point-single-write-point semantics, with multiple read and write points.
Phoenix extensively employs stored procedures implemented as HBase filters in order to eliminate the overhead of multiple round-trips to the data store. We integrated Omid’s code within such HBase-resident procedures, allowing for a smooth integration with Phoenix and also reduced the overhead of transactional reads (for example, filtering out redundant data versions).
We collaborated closely with the Phoenix developer community while working on this project, and contributed code to Phoenix that made Omid’s integration possible. We look forward to seeing Omid’s adoption through a wide range of Phoenix applications. We always welcome new developers to join the community and help push Omid forward!
By Scott Bush, Director, Hadoop Software Engineering, Oath
On Tuesday, September 25, we hosted a special day-long Hadoop Contributors Meetup at our Sunnyvale, California campus. Much of the early Hadoop development work started at Yahoo, now part of Oath, and has continued over the past decade. Our campus was the perfect setting for this meetup, as we continue to make Hadoop a priority.
More than 80 Hadoop users, contributors, committers, and PMC members gathered to hear talks on key issues facing the Hadoop user community.
Speakers from Ampool, Cloudera, Hortonworks, Microsoft, Oath, and Twitter detailed some of the challenges and solutions pertinent to their parts of the Hadoop ecosystem. The talks were followed by a number of parallel, birds of a feather breakout sessions to discuss HDFS, Tez, containers and low latency processing. The day ended with a reception and consensus that the event went well and should be repeated in the near future.
Presentation recordings (YouTube playlist) and slides (links included in the video description) are available here:
Thank you to all the presenters and the attendees both in person and remote!
P.S. We’re hiring! Learn more about career opportunities at Oath.
We’re excited to share that the 2nd Annual MolochON will be Thursday, Nov. 1, 2018 in Dulles, Virginia, at the Oath campus. Moloch is a large-scale, open source, full packet capturing, indexing and database system.
There’s no cost to attend the event and we’d love to see you there! Feel free to register here.
We’ll be joined by many fantastic speakers from the Moloch community to present on the following topics:
Since the last MolochON, many new features have been added to Moloch. We will review some of these features and demo how to use them. We will also discuss a few desired upcoming features.
Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back.
Elyse is the UI and full stack engineer for Moloch. She revamped the UI to be more user-friendly and maintainable. Now that the revamp has been completed, Elyse is working on implementing awesome new Moloch features!
Small Scale at Large Scale: Putting Moloch on the Analyst’s Desk
by Phil Hagen, SANS Senior Instructor, DFIR Strategist, Red Canary
I’ve been excited to add Moloch to the FOR572 class, Advanced Network Forensics at the SANS Institute. In FOR572, we cover Moloch with nearly 1,000 students per year, via classroom discussions and hands-on labs. This presents an interesting engineering problem, in that we provide a self-contained VMware image for the classroom lab, but it is also suitable for use in forensic casework. In this talk, I’ll cover some of what we did to make a single VM into a stable and predictable environment, distributed to hundreds of students across the world.
Phil is a Senior Instructor with the SANS Institute and the DFIR Strategist at Red Canary. He is the course lead for SANS FOR572, Advanced Network Forensics, and has been in the information security industry for over 20 years. Phil is also the lead for the SOF-ELK project, which provides a free, open source, ready-to-use Elastic Stack appliance to aid and optimize security operations and forensic processing. Networking is in his blood, dating back to a 2400 baud modem in an Apple //e, which he still has.
by Andy Wick, Sr Princ Architect, Oath
The formation of Oath gave us an opportunity to rethink and create a new visibility stack. In this talk, we will be sharing our process for designing our stack for both office and data center deployments and discussing the technologies we decided to use.
Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back.
Centralized Management and Deployment with Docker and Ansible
by Taylor Ashworth, Cybersecurity Analyst
I will focus on how to use Docker and Ansible to deploy, update, and manage Moloch along with other tools like Suricata, WISE, and ES. I will explain the time-saving benefits of Ansible and the workload reduction benefits of Docker,and I will also cover the topic “Pros and cons of using Ansible tower/AWX over Ansible in CLI.” If time permits, I’ll discuss “Using WISE for data enrichment.”
Taylor is a cybersecurity analyst who was tired of the terrible tools he was presented with and decided to teach himself how to set up tools to successfully do his job.
Automated Threat Intel Investigation Pipeline
by Matt Carothers, Principal Security Architect, Cox Communications
I will discuss integrating Moloch into an automated threat intel investigation pipeline with MISP.
Matt enjoys sunsets, long hikes in the mountains and intrusion detection. After studying Computer Science at the University of Oklahoma, he accepted a position with Cox Communications in 2001 under the leadership of renowned thought leader and virtuoso bass player William “Wild Bill” Beesley, who asked to be credited in this bio. There, Matt formed Cox’s abuse department, which he led for several years, and today he serves as Cox’s Principal Security Architect.
by Andy Wick, Sr Princ Architect, Oath
We will review how to use WISE and provide real-life examples of features added since the last MolochON.
Andy is the creator of Moloch and former Architect of AIM. He joined the security team in 2011 and hasn’t looked back.
by Srinath Mantripragada, Linux Integrator, SecureOps
I will present a Moloch deployment with 20+ different Moloch nodes. A range will be presented, including small, medium, and large deployments that go from full hardware with dedicated capture cards to virtualized point-of-presence and AWS with transit network. All nodes run Moloch, Suricata and Bro.
Srinath has worked as a SysAdmin and related positions for most of his career. He currently works as an Integrator/SysAdmin/DevOps for SecureOps, a Security Services company in Montreal, Canada.
Elasticsearch for Time-series Data at Scale
by Andrew Selden, Solution Architect, Elastic
Elasticsearch has evolved beyond search and logging to be a first-class, time-series metric store. This talk will explore how to achieve 1 million metrics/second on a relatively modest cluster. We will take a look at issues such as data modeling, debugging, tuning, sharding, rollups and more.
Andrew Selden has been running Elasticsearch at scale since 2011 where he previously led the search, NLP, and data engineering teams at Meltwater News and later developed streaming analytics solutions for BlueKai’s advertising platform (acquired by Oracle). He started his tenure at Elastic as a core engineer and for the last two years has been helping customers architect and scale.
Hope to see you there!
By Jon Bratseth, Distinguished Architect, Oath
I had the wonderful opportunity to present Vespa at the SF Big Analytics Meetup on September 26th, hosted by Amplitude. Several members of the Vespa team (Kim, Frode and Kristian) also attended. We all enjoyed meeting with members of the Big Analytics community to discuss how Vespa could be helpful for their companies. Thank you to Chester Chen, T.J. Bay, and Jin Hao Wan for planning the meetup, and here’s our presentation, in case you missed it (slides are also available here):
Largely developed by Yahoo engineers, Vespa is our big data processing and serving engine, available as open source on GitHub. It’s in use by many products, such as Yahoo News, Yahoo Sports, Yahoo Finance and Oath Ads Platforms.
Vespa use is growing even more rapidly; since it is open source under a permissive Apache license, Vespa can power other external third-party apps as well.
A great example is Zedge, which uses Vespa for search and recommender systems to support content discovery for personalization of mobile phones (Android, iOS, and Web). Zedge uses Vespa in production to serve millions of monthly active users.
Visit https://vespa.ai/ to learn more and download the code. We encourage code contributions and welcome opportunities to collaborate.
By Ian Flint, Network Automation Architect and Varun Varma, Senior Principal Engineer
The Oath network automation team is proud to announce that we are open-sourcing Panoptes, a distributed system for collecting, enriching and distributing network telemetry.
We developed Panoptes to address several issues inherent in legacy polling systems, including overpolling due to multiple point solutions for metrics, a lack of data normalization, consistent data enrichment and integration with infrastructure discovery systems.
Panoptes is a pluggable, distributed, high-performance data collection system which supports multiple polling formats, including SNMP and vendor-specific APIs. It is also extensible to support emerging streaming telemetry standards including gNMI.
The following block diagram shows the major components of Panoptes:
Panoptes is written primarily in Python, and leverages multiple open-source technologies to provide the most value for the least development effort. At the center of Panoptes is a metrics bus implemented on Kafka. All data plane transactions flow across this bus; discovery publishes devices to the bus, polling publishes metrics to the bus, and numerous clients read the data off of the bus for additional processing and forwarding. This architecture enables easy data distribution and integration with other systems. For example, in preparing for open-source, we identified a need for a generally available time series datastore. We developed, tested and released a plugin to push metrics into InfluxDB in under a week. This flexibility allows Panoptes to evolve with industry standards.
Check scheduling is accomplished using Celery, a horizontally scalable, open-source scheduler utilizing a Redis data store. Celery’s scalable nature combined with Panoptes’ distributed nature yields excellent scalability. Across Oath, Panoptes currently runs hundreds of thousands of checks per second, and the infrastructure has been tested to more than one million checks per second.
Panoptes ships with a simple, CSV-based discovery system. Integrating Panoptes with a CMDB is as simple as writing an adapter to emit a CSV, and importing that CSV into Panoptes. From there, Panoptes will manage the task of scheduling polling for the desired devices. Users can also develop custom discovery plugins to integrate with their CMDB and other device inventory data sources.
Finally, any metrics gathering system needs a place to send the metrics. Panoptes’ initial release includes an integration with InfluxDB, an industry-standard time series store. Combined with Grafana and the InfluxData ecosystem, this gives teams the ability to quickly set up a fully-featured monitoring environment.
Deployment at Oath
At Oath, we anticipate significant benefits from building Panoptes. We will consolidate four siloed polling solutions into one, reducing overpolling and the associated risk of service interruption. As vendors move toward streaming telemetry, Panoptes’ flexible architecture will minimize the effort required to adopt these new protocols.
There is another, less obvious benefit to a system like Panoptes. As is the case with most large enterprises, a massive ecosystem of downstream applications has evolved around our existing polling solutions. Panoptes allows us to continue to populate legacy datastores without continuing to run the polling layers of those systems. This is because Panoptes’ data bus enables multiple metrics consumers, so we can send metrics to both current and legacy datastores.
At Oath, we have deployed Panoptes in a tiered, federated model. We install the software in each of our major data centers and proxy checks out to smaller installations such as edge sites. All metrics are polled from an instance close to the devices, and metrics are forwarded to a centralized time series datastore. We have also developed numerous custom applications on the platform, including a load balancer monitor, a BGP session monitor, and a topology discovery application. The availability of a flexible, extensible platform has greatly reduced the cost of producing robust network data systems.
Panoptes’ open-source release is packaged for easy deployment into any Linux-based environment. Deployment is straightforward, so you can have a working system up in hours, not days.
We are excited to share our internal polling solution and welcome engineers to contribute to the codebase, including contributing device adapters, metrics forwarders, discovery plugins, and any other relevant data consumers.
By Joe Francis, Director, Storage & Messaging
We’re excited to share that The Apache Software Foundation announced today that Apache Pulsar has graduated from the incubator to a Top-Level Project. Apache Pulsar is an open-source distributed pub-sub messaging system, created by Yahoo in June 2015 and submitted to the Apache Incubator in June 2017.
Apache Pulsar is integral to the streaming data pipelines supporting Oath’s core products including Yahoo Mail, Yahoo Finance, Yahoo Sports and Oath Ad Platforms. It handles hundreds of billions of data events each day and is an integral part of our hybrid cloud strategy. It enables us to stream data between our public and private clouds and allows data pipelines to connect across the clouds.
Oath continues to support Apache Pulsar, with contributions including best-effort messaging, load balancer and end-to-end encryption. With growing data needs handled by Apache Pulsar at Oath, we’re focused on reducing memory pressure in brokers and bookkeepers, and creating additional connectors to other large-scale systems.
Apache Pulsar’s future is bright and we’re thrilled to be part of this great project and community.
P.S. We’re hiring! Learn more here.
By Arjun Mannaly, Senior Software Engineer
At Oath, multiple ad platforms use a high throughput, low latency distributed key-value database that runs in data centers all over the world. The database stores billions of records and handles millions of read and write requests per second at millisecond latencies. The data we have in this database must be persistent, and the working set is larger than what we can fit in memory. Therefore, a key component of the database performance is a fast storage engine. Our current solution had served us well, but it was primarily designed for a read-heavy workload and its write throughput started to be a bottleneck as write traffic increased.
There were other additional concerns as well; it took hours to repair a corrupted DB, or iterate over and delete records. The storage engine also didn’t expose enough operational metrics. The primary concern though was the write performance, which based on our projections, would have been a major obstacle for scaling the database. With these concerns in mind, we began searching for an alternative solution.
We searched for a key-value storage engine capable of dealing with IO-bound workloads, with submillisecond read latencies under high read and write throughput. After concluding our research and benchmarking alternatives, we didn’t find a solution that worked for our workload, thus we were inspired to build HaloDB. Now, we’re glad to announce that it’s also open source and available to use under the terms of the Apache license.
HaloDB has given our production boxes a 50% improvement in write capacity while consistently maintaining a submillisecond read latency at the 99th percentile.
HaloDB primarily consists of append-only log files on disk and an index of keys in memory. All writes are sequential writes which go to an append-only log file and the file is rolled-over once it reaches a configurable size. Older versions of records are removed to make space by a background compaction job.
The in-memory index in HaloDB is a hash table which stores all keys and their associated metadata. The size of the in-memory index, depending on the number of keys, can be quite large, hence for performance reasons, is stored outside the Java heap, in native memory. When looking up the value for a key, corresponding metadata is first read from the in-memory index and then the value is read from disk. Each lookup request requires at most a single read from disk.
The chart below shows the results of performance tests with real production data. The read requests were kept at 50,000 QPS while the write QPS was increased. HaloDB scaled very well as we increased the write QPS while consistently maintaining submillisecond read latencies at the 99th percentile.
The chart below shows the 99th percentile latency from a production server before and after migration to HaloDB.
By Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, Hagar Meir, Gali Sheffi
Real-time analytics applications are on the rise. Modern decision support and machine intelligence engines strive to continuously ingest large volumes of data while providing up-to-date insights with minimum delay. For example, in Flurry Analytics, an Oath service which provides mobile developers with rich tools to explore user behavior in real time, it only takes seconds to reflect the events that happened on mobile devices in its numerous dashboards. The scalability demand is immense – as of late 2017, the Flurry SDK was installed on 2.6B devices and monitored 1M+ mobile apps. Mobile data hits the Flurry backend at a huge rate, updates statistics across hundreds of dimensions, and becomes queryable immediately. Flurry harnesses the open-source distributed interactive analytics engine named Druid to ingest data and serve queries at this massive rate.
In order to minimize delays before data becomes available for analysis, technologies like Druid should avoid maintaining separate systems for data ingestion and query serving, and instead strive to do both within the same system. Doing so is nontrivial since one cannot compromise on overall correctness when multiple conflicting operations execute in parallel on modern multi-core CPUs. A promising approach is using concurrent data structure (CDS) algorithms which adapt traditional data structures to multiprocessor hardware. CDS implementations are thread-safe – that is, developers can use them exactly as sequential code while maintaining strong theoretical correctness guarantees. In recent years, CDS algorithms enabled dramatic application performance scaling and became popular programming tools. For example, Java programmers can use the ConcurrentNavigableMap JDK implementations for the concurrent ordered key-value map abstraction that is instrumental in systems like Druid.
Today, we are excited to share Oak, a new open source project from Oath, available under the Apache License 2.0. The project was created by the Scalable Systems team at Yahoo Research. It extends upon our earlier research work, named KiWi.
Oak is a Java package that implements OakMap – a concurrent ordered key-value map. OakMap’s API is similar to Java’s ConcurrentNavigableMap. Java developers will find it easy to switch most of their applications to it. OakMap provides the safety guarantees specified by ConcurrentNavigableMap’s programming model. However, it scales with the RAM and CPU resources well beyond the best-in-class ConcurrentNavigableMap implementations. For example, it compares favorably to Doug Lea’s seminal ConcurrentSkipListMap, which is used by multiple big data platforms, including Apache HBase, Druid, EVCache, etc. Our benchmarks show that OakMap harnesses 3x more memory, and runs 3x-5x faster on analytics workloads.
OakMap’s implementation is very different from traditional implementations such as ConcurrentSkipListMap. While the latter maintains all keys and values as individual Java objects, OakMap stores them in very large memory buffers allocated beyond the JVM-managed memory heap (hence the name Oak – abbr. Off-heap Allocated Keys). The access to the key-value pairs is provided by a lightweight two-level on-heap index. At its lower level, the references to keys are stored in contiguous chunks, each responsible for a distinct key range. The chunks themselves, which dominate the index footprint, are accessed through a lightweight top-level ConcurrentSkipListMap. The figure below illustrates OakMap’s data organization.
The maintenance of OakMap’s chunked index in a concurrent setting is the crux of its complexity as well as the key for its efficiency. Experiments have shown that our algorithm is advantageous in multiple ways:
1. Memory scaling. OakMap’s custom off-heap memory allocation alleviates the garbage collection (GC) overhead that plagues Java applications. Despite the permanent progress, modern Java GC algorithms do not practically scale beyond a few tens of GBs of memory, whereas OakMap scales beyond 128GB of off-heap RAM.
2. Query speed. The chunk-based layout increases data locality, which speeds up both single-key lookups and range scans. All queries enjoy efficient, cache-friendly access, in contrast with permanent dereferencing in object-based maps. On top of these basic merits, OakMap provides safe direct access to its chunks, which avoids an extra copy for rebuilding the original key and value objects. Our benchmarks demonstrate OakMap’s performance benefits versus ConcurrentSkipListMap:
A) Up to 2x throughput for ascending scans.
B) Up to 5x throughput for descending scans.
C) Up to 3x throughput for lookups.
3. Update speed. Beyond avoiding the GC overhead typical for write-intensive workloads, OakMap optimizes the incremental maintenance of big complex values – for example, aggregate data sketches, which are indispensable in systems like Druid. It adopts in situ computation on objects embedded in its internal chunks to avoid unnecessary data copy, yet again. In our benchmarks, OakMap achieves up to 1.8x data ingestion rate versus ConcurrentSkipListMap.
With key-value maps being an extremely generic abstraction, it is easy to envision a variety of use cases for OakMap in large-scale analytics and machine learning applications – such as unstructured key-value storage, structured databases, in-memory caches, parameter servers, etc. For example, we are already working with the Druid community on rebuilding Druid’s core Incremental Index component around OakMap, in order to boost its scalability and performance.
We look forward to growing the Oak community! We invite you to explore the project, use OakMap in your applications, raise issues, suggest improvements, and contribute code. If you have any questions, please feel free to send us a note on the Oak developers list: [email protected]. It would be great to hear from you!