Page 1 of 1

Storage of Experimental Data

Posted: Mon Aug 16, 2010 3:31 pm
by BenTC
Some informative extracts from idle chatter about storage of experimental data, which may be of interest to some.
From... Ask Slashdot - How-Do-You-Organize-Your-Experimental-Data
Hello, I'm a space research guy.
I've recently made a comparison of MySQL 5.0, Oracle 10i and HDF5 file based data storage for our space data. The results are amusing (the linked page contains charts and explanations; pay attention to the conclusive chart, it is the most informative one). In short (for those who don't want to look at the charts): using relational databases for pretty common scientific tasks sucks badly performance-wise.

Disclaimer: while I'm positively not a guru DBA and thus admit that both of the databases tested could be configured and optimized better, but the thing is that I am not supposed to. Neither is the OP. While we may do lots of programming work to accomplish our scientific tasks, being a qualified DBA is a completely separate challenge - an unwanted one, as well.

So far, PyTables/HDF5 FTW. Which brings us back to the OP's question about organizing these files...
Instead of trying to organize your data into a directory structure, use tagging instead. There's a lot of theory on this -- originally from library science, and more recently from user interface studies. The basic idea is that you often want your data to be in more than one category. In the old days, you couldn't do this, because in a library a book had to be on one and only one shelf. In this digital world you can put a book on more than one "shelf" by assigning multiple tags to it.

Then, to find what you want, get a search engine that supports faceted navigation.[wikipedia.org]

Four "facets" of ten nodes each have the same discriminatory power as a single hierarchy of 10,000 nodes. It's simpler, cleaner, faster, and you don't have to reorganize anything. Just be careful about how you select the facets/tags. Use some kind of controlled vocabulary, which you may already have.

There are a bunch of companies that sell such search engines, including Dieselpoint, Endeca, Fast, etc.
I have very similar data collection requirements and strategy with one exception: the data that can be made human-readable in original format are made so. Always. Every original file that gets written has the read-only bit turned on (or writeable bit turned off, whichever floats your boat) as soon as it is closed. Original files are NEVER EVER DELETED and NEVER EVER MODIFIED. If a mistake is discovered requiring a modification to a file, a specially tagged version is created, but the original is never deleted or modified.

Also, every single data file, log file, and whatever else that needs to be associated with it is named with a YYMMDD-HHMMSS- prefix and since experiments in my world are day-based, are put into a single directory called YYMMDD. I've used this system now for nearly 20 years and not screwed up with using the wrong file, yet. Files are always named in a way that (a) doing a directory listing with alpha sort produces an ordering that makes sense and is useful, and (b) there is no doubt as to what experiment was done.

In addition, every variable that is created in the original data files has a clear, descriptive, and somewhat verbose name that is replicated through in the MATLAB structures.

Finally, and very importantly, the code that ran on the data collection machines is archived with each day's data set so that when bugs are discovered we can know EXACTLY which data sets were affected. As a scientist, your data files are your most valuable possessions, and need to be accorded the appropriate care. If you're doing anything ad-hoc after more than one experiment, then you aren't putting enough time into a devising a proper system.

(I once described my data collection strategy to a scientific instrument vendor and he offered me a job on the spot.)

I also make sure that when figures are created for my papers I've got a clear and absolutely reproducible path from the raw data to the final figures that include ZERO manual intervention. If I connect to the appropriate directory and type "make clean ; make", it may take a few hours or days to complete, but the figures will be regenerated, down to every single arrow and label. For the aspiring scientist (and all of the people working in my lab who might be reading this), this is perhaps the most important piece of advice I can give. Six months, two years, five years from now when someone asks you about a figure and you need to understand how it was created, the *only* way of knowing that these days is having a fully scripted path from raw data to final figure. Anything that required manual intervention generally cannot be proven to have been done correctly.
Document Management Systems [wikipedia.org] are great - they combine (some of) the benefits of source control, file systems, and email (collaboration).

I would recommend just downloading a VM or cloud image of something like Knowledge Tree or Alfresco(I personally prefer Alfresco), and run it on the free vmwareplayer or a real VM solution if you have one.

I recently setup a demo showing the benefits of such a system, I was able to, in about one day, download and setup Alfresco, expose CIFS interface (ie, \\192.168.x.x\documents) and just dump a portion of my entire document base into the system. After digestion, the system had all the documents full-text-indexed (yes, even word docs and excel files thanks to OpenOffice libraries), and I could go about changing directory structure, moving around and renaming files, etc. .. and the source control would show me changes. In fact, I could go into the backend and write SQL queries if I wanted to with detailed reports of how things were on date X or Y revisions ago. Was quite sweet. All the while, the users still saw the same windows directory structure and modifications they made there would be versioned and modified in Alfresco's database.

Here is a bitnami VM image[bitnami.org], will save you days of configuration. If the solution works for you, but is slow, just DL the native stack and migrate or re-import.
Alfresco is actually something I've been wanting to try out for a while.

---------------
And of course whats slashdot without some trolling humour...
Climatologists will need four directories...
$PRJ_ROOT/data/theoretical
$PRJ_ROOT/data/fits
$PRJ_ROOT/data/doesnt_fit
$PRJ_ROOT/data/doesnt_fit/fixed
$PRJ_ROOT/data/made_up

couch DB

Posted: Wed Sep 15, 2010 9:24 pm
by Jeff Mauldin
some people at my work are experimenting with couch DB for dealing with datasets, results of algorithms run on datasets, etc. It's a RESTful database, and it provides reasonable structure with url access. It doesn't have the rigidity of a relational database and it gives more structure and ease of access than just arranging your data in a file system.

I also see that at least one commentr in the linked article mentioned couch DB.

Posted: Thu Sep 16, 2010 4:01 pm
by BenTC
That looks interesting. My work involves going to client sites a lot for commissioning support callouts. I've been considering having a doc-control system where each laptop has a local copy of client history and programs, so that you have everything you could possibly need when the heat is on. I liked in the intro:[quote]CouchDB is a peer based distributed database system. Any number of CouchDB hosts (servers and offline-clients) can have independent “replica copies” of the same database, where applications have full database interactivity (query, add, edit, delete). When back online or on a schedule, database changes are replicated bi-directionally.

CouchDB has built-in conflict detection and management and the replication process is incremental and fast, copying only documents and individual fields changed since the previous replication. Most applications require no special planning to take advantage of distributed updates and replication.[/quote