-rw-r--r-- | README | 56 |
1 files changed, 37 insertions, 19 deletions
@@ -11,26 +11,43 @@ their meta-data in a number of different ways. An exploration of the phylogenetic properties of the records first requires that the available data be collected and inventoried. -Two primary alternatives have been identified for managing the data. A -relational database can be used. IBM DB2 has been used for this. The use of -a relational database is limited by the difficulty in sharing the data. Each -vendor uses incompatible import and export routines. Additionally installing -an instance of a database management system (DBMS) often requires a large -amount of effort and many not be practical on hosted environments which do not -support the running of user daemons. Finally proper parallelization of a DBMS -will require additional system specific configuration for each machine used. - -An alternative to the DBMS is to use a container file format such as HDF5. -This has the advantage that all of the data can be collected into a single -file which can then be shared with others. It has the disadvantage that is -lacks the robust search and SQL operations provided by a DBMS. In addition to -two alternatives use fundamentally different storage strategies with the DBMS -using a relational model and the contain file format using a hierarchical -model. +Two primary alternatives have been identified for managing the data. +A relational database can be used. IBM DB2 has been used for this in +exp004. The use of a relational database is limited by the difficulty +in sharing the data. Each vendor uses incompatible import and export +routines. Additionally installing an instance of a database +management system (DBMS) often requires a large amount of effort and +many not be practical on hosted environments which do not support the +running of user daemons. Proper parallelization of a DBMS will +require additional system specific configuration for each machine +used. Generally a single DB2 instance with Internet connectivity has +been used in conjunction with DB2 client installations on the +analytical environments. + +An alternative to the DBMS is to use a container file format such as +HDF5. This has the advantage that all of the data can be collected +into a single file which can then be shared with others. It has the +disadvantage that it lacks the robust search and SQL operations +provided by a DBMS. These two alternatives use fundamentally +different storage strategies with the DBMS using a relational model +and the container file format using a hierarchical model. The "doc/Data Deployments.dia" diagram shows the source systems that -expose the various records as well as the transform routines that are -used for aggregation of the data on the local system. +expose the various influenza records as well as the transform routines +that are used for aggregation of the data on the local system. +Initially it may appear that loading the text files directly into the +HDF5 container is redundant, particularly as a pure pre-processing +step. This will be a redundant effort for cases where tools are used +which require yet another load step. For custom C programs however +reading the data from disk and converting it from ASCII text to a +native datatype is a necessary preprocessing step. Sharing the C +struct definitions between HDF5 and the native code is the key +differentiator between loading from text and loading from the binary +HDF5 container. Since these read and conversion operations must be +done in the C code anyway the additional effort to save their results +in the HDF5 container are justified by any time that can be saved by +reusing the HDF5 data rather than rerunning the read and conversion +operations from plain text. BUILDING @@ -60,4 +77,5 @@ verify that the load was completed without error. Protein Sequences.txt are identical LocalWords: NCBI parallelization HDF SQL Pellegrino phylogenetic DBMS dia mpi - LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt + LocalWords: autogen Autotools CPPFLAGS aa dat HDFView GUI diff txt exp pre + LocalWords: datatype struct |