Distributed RDF Engine with Adaptive Query
Optimization & Minimal Communication
Centralized Versus Distributed RDF Systems
The wide adoption of the RDF data model has called for efficient and scalable RDF schemes. As a response to this call, many centralized [1-8] and distributed [9-13] RDF systems have been proposed in literature. A main characteristic of centralized systems is that they do not incur any communication overhead (i.e., they process all data locally), but remain limited by the computational power and memory capacities of single machines. On the other hand, with distributed systems, RDF data is typically partitioned among clustered machines using various partitioning algorithms such as hash or graph partitioning. As such, compared to centralized systems, distributed RDF systems are characterized by larger aggregate memory capacities and higher computational power. In contrast, distributed RDF systems might incur huge intermediate data shuffling when satisfying (complex) SPARQL queries, especially if queries span multiple disjoint partitions. In principle, intermediate data shuffling can greatly degrade query performance, and reducing it is essentially one of the major research challenges in designing distributed RDF systems.
The Four RDF Paradigms: A New Classification
In this work, we promote a new classification of paradigms by which RDF systems can be designed. In particular, we suggest that RDF systems can be built in four different ways as portrayed in Figure 1. All existing RDF systems lie under Quadrants I, II and III, wherein they either store an input RDF dataset, D, unsliced at a single machine and do not partition a SPARQL query, Q (i.e., Quadrant-I or centralized), or partition D and/or Q (i.e., Quadrants II and/or III). Interestingly, there is no RDF system yet that falls under Quadrant-IV. With Quadrant-IV, D is maintained as is at each machine while Q is partitioned. Consequently, data shuffling can be completely avoided (i.e., each machine has all data) while computational power and memory capacities can be escalated, thus offering a hybrid paradigm between centralized and distributed schemes.
To this end, we note two main points. First, partitioning (which is theoretically NP-hard) RDF graphs, as suggested under Quadrants II and III, can render extremely intricate and expensive. In essence, the larger and more twisted the RDF graphs are, the harder graph partitioning algorithms turn out. Second, different partitioning algorithms suit different queries. Thus, there is no one-size-fits-all partitioning algorithm for all types of queries (e.g., simple and complex with variances). Indeed, any partitioning algorithm will result in intermediate data shuffling for some query workloads. Our objective is to entirely overcome these two problems and attain minimal intermediate data communication. Clearly, a simple and effective approach to meet such an objective is to pursue Quadrant-IV and, accordingly: (1) preclude the complexity of partitioning algorithms altogether, and (2) offer a one-size-fits-all paradigm for all sorts of SPARQL queries. We realize our objective through DREAM, a Quadrant-IV citizen and the first in its breed.
Why Quadrant-IV is Feasible?
The authors in  observed that Big Graphs are not Big Data, indicating the feasibility of the Quadrant-IV paradigm. To put this in perspective, the largest RDF dataset that we know of nowadays consists of 13.6 billions of triples, which evaluates to only 2.5TB . Modern physical disks can fit 6TB of data . Furthermore, on Amazon EC2, users can provision EC2 instances with disk sizes of 24×2048GB. Let alone that a user can attach multiple (e.g., 24) Amazon Elastic Block Storage (EBS) volumes to a single EC2 instance, each with a capacity of 1TB. In the “DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication” paper, we test DREAM with 7 billion triples (or 1.2TB) on Amazon EC2 using the r3.2xlarge instance type and EBS volumes. Results demonstrate the effectiveness of DREAM with large-scale datasets.
 Kevin Wilkinson, Craig Sayers, Harumi A Kuno, Dave Reynolds, et al. Efficient RDF Storage and Retrieval in Jena2. In SWDB, volume 3, 2003.
 Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. In ISWC. 2002.
 Eugene Inseok Chong, Souripriya Das, George Eadon, and Jagannathan Srinivasan. An efficient sql-based RDF querying scheme. In VLDB, 2005.
 Renzo Angles and Claudio Gutierrez. Querying RDF data from a graph database perspective. In The Semantic Web: Research and Applications. 2005.
 Daniel J Abadi, Adam Marcus, Samuel R Madden, and Kate Hollenbach. Scalable semantic web data management using vertical partitioning. In VLDB, 2007.
 Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. Hexastore: sextuple indexing for semantic web data management. PVLDB, 1(1), 2008.
 Thomas Neumann and Gerhard Weikum. The RDF-3x engine for scalable management of RDF data. The VLDB Journal, 19(1), 2010.  Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. Triplebit: a fast and compact system for large scale RDF data. PVLDB, 6(7), 2013.
 Jiewen Huang, Daniel J Abadi, and Kun Ren. Scalable SPARQL querying of large RDF graphs. PVLDB, 4(11), 2011.
 Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang. A distributed graph engine for web scale RDF data. In VLDB, 2013.
 Bin Shao, Haixun Wang, and Yatao Li. Trinity: A distributed graph engine on a memory cloud. In SIGMOD, 2013.
 Nikolaos Papailiou, Ioannis Konstantinou, Dimitrios Tsoumakos, Panagiotis Karras, and Nectarios Koziris. H2RDF+: High-performance distributed joins over large-scale RDF graphs. In IEEE Big Data, 2013.
 Sairam Gurajada, Stephan Seufert, Iris Miliaraki, and Martin Theobald. Triad: A distributed shared-nothing RDF engine based on asynchronous message passing. In SIGMOD, 2014.
 Kyrola, Aapo, Guy E. Blelloch, and Carlos Guestrin. "GraphChi: Large-Scale Graph Computation on Just a PC." OSDI. Vol. 12. 2012.
 SEAGATE: seagate