Monday, August 18, 2014

DeltaDB - A Scalable Database Design for Time-Varying Schema-Free Data

DeltaDB is a log-structure database and query model designed for time-varying and schema-free data. The following video gives a high level overview of DeltaDB and describes how the model is scalable using MapReduce.



This database design is implemented within CCTools in two parts. Part 1 (data storage) has been available for over a year and is called the catalog server. Part 2 (data analysis) has recently been implemented and is not yet in a release, but is available in the following commit:

https://github.com/pivie/cctools/commit/bca998baf00c71484b567110d73c36bd042c3b3e





The data model is designed to handle schema-free status reports from various services. And while the reports can be schema-free, most of the fields will normally remain the same between subsequent reports from the same instance of a service.






The first status report is saved in it's entirety, and then the subsequent reports are saved as changes (or "deltas") on the original report. Snapshots of the status of all services and instances are stored on a daily basis. This allows a query for analysis based on a given time frame to jump more quickly to the start of the time frame, rather than have to start at the very beginning of the life of the catalog server.




A query is performed by applying a series of operators to the data. For a distributed system, spatial distribution is when the data is distributed such that a given instance always ends up on the same node. In this situation, all but the last of the operators can be performed in the map stage of the MapReduce model. This allows for better scalability because less work has to be performed by a single node in the reduce stage.


Much more detail is provided in a paper which was published at IEEE Bigdata 2014, and is available at the following URL:

http://ccl.cse.nd.edu/research/papers/pivie-deltadb-2014.pdf

For further inquiries, please email pivie@nd.edu.


Friday, August 1, 2014

Packaging Applications with Parrot 4.2.0

CCTools 4.2.0 includes a new feature in Parrot that allows you to automatically observe all of the files used by a given application, and then collect them up into a self-contained package.  The package can then be moved to another machine -- even a different variant of Linux -- and then run correctly with all of its dependencies present. The created package does not depend upon Parrot and can be re-run in a variety of ways.
  
This article explains how to generate a self-contained package and then share it so that others can verify can repeat your applications. The whole process involves three steps: running the original application, creating the self-contained package, and the running the package itself.


Figure 1 Packaging Procedure
Step 1: Run the original program

Run your program under parrot_run and record the filename list and environment variables by using --name-list and --env-list parameters.

parrot_run --name-list namelist --env-list envlist /bin/bash
 
After the execution of this command, you can run your program inside parrot_run.  At the end of step 1, one file named namelist containing all the accessed file names and one file named envlist containing environment variables will be generated.  After everything is done, simple exit the shell.

Step 2: Generate a self-contained package

Use parrot_package_create to generate a package based on the namelist and envlist generated in step 1.

parrot_package_create --name-list namelist --env-path envlist --package-path /tmp/package
 
This command causes all of the files given in the name list to be copied into the package directory /tmp/package.  You may customize the contents of the package by editing the namelist or the package directory by hand.

Step 3: Repeat the program using the package

The newly created package is simply a complete filesystem tree that can be moved to any convenient location.  It can be re-run by any method that treats the package as a self-contained root filesystem.  This can be done by using Parrot again, by setting up a chroot environment, by setting up a Linux container, or by creating a virtual machine.

To run the package using Parrot, do this:

parrot_package_run --package-path /tmp/package /bin/bash 

To run the package using chroot, do this:

chroot_package_run --package-path /tmp/package /bin/bash

In both cases, you will be dropped into a shell in the preserved environment, where all the files used by the original command will be present.  You will definitely be able to run the original command -- whether you can run other programs depends upon the quantity of data preserved.

For more information, see these man pages: