Distributed RDF Engine with Adaptive Query
Optimization & Minimal Communication
DREAM employs the Quadrant-IV paradigm. As such, it stores an RDF dataset unsliced at each cluster machine and adopts a graph-based, rule-oriented query planner, which effectively partitions any given SPARQL query, Q. In particular, the query planner transforms Q into a graph, G, decomposes G into many sets of sub-graphs (using four main rules), and maps each set to a separate machine. Afterwards, all machines process their sets of sub-graphs in parallel and coordinate with each other to produce the final query result. No intermediate data is shuffled whatsoever and only minimal control messages and meta-data (or what we refer to as auxiliary data) are exchanged. To decide upon the number of sets (which dictates the number of machines) and their constituent sub-graphs (i.e., G's graph plan), DREAM’s query planner enumerates various possibilities and selects a graph plan that will expectedly result in the lowest I/O cost for G. This is achieved through utilizing a new cost model, which relies on RDF graph statistics. In the view of that, different numbers of machines for different query types are pursued by DREAM, hence, rendering it adaptive. For details on how exactly the query planner of DREAM works, please refer to Section 3.2 “DREAM: Distributed RDF Engine with Adaptive Query Planner and Minimal Communication”.