This database design is implemented within CCTools in two parts. Part 1 (data storage) has been available for over a year and is called the catalog server. Part 2 (data analysis) has recently been implemented and is not yet in a release, but is available in the following commit:
The data model is designed to handle schema-free status reports from various services. And while the reports can be schema-free, most of the fields will normally remain the same between subsequent reports from the same instance of a service.
The first status report is saved in it's entirety, and then the subsequent reports are saved as changes (or "deltas") on the original report. Snapshots of the status of all services and instances are stored on a daily basis. This allows a query for analysis based on a given time frame to jump more quickly to the start of the time frame, rather than have to start at the very beginning of the life of the catalog server.
A query is performed by applying a series of operators to the data. For a distributed system, spatial distribution is when the data is distributed such that a given instance always ends up on the same node. In this situation, all but the last of the operators can be performed in the map stage of the MapReduce model. This allows for better scalability because less work has to be performed by a single node in the reduce stage.
For further inquiries, please email email@example.com.