10. Software Management#
In addition to writing a program for each computing task, knowldge and skills are needed for designing and managing the entire data analysis or simulation procedure, testing and revising the codes, and sharing the data and tools among collaborators.
Unlike commercial software development, research computing often starts from a simple exploratory code created by a single researcher. However, even for a single-person project, it is beneficial to follow the standard practices in software development because.
If your project is successful, it will be succeeded by other members of your lab or the research community world wide.
You, after a few months, do not remember where you put which file, what was this file, or why you wrote this code.
References:#
Greg Wilson, et al. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6): e1005510 (https://doi.org/10.1371/journal.pcbi.1005510)
Hart EM, et al. (2016). Ten simple rules for digital data storage. PLoS Comput Biol, 12, e1005097. https://doi.org/10.1371/journal.pcbi.1005097
Ouellette F, et al. (2018). A FAIR guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol, 14. https://doi.org/10.1371/journal.pcbi.1005873
Eke DO, Bernard A, Bjaalie JG, Chavarriaga R, Hanakawa T, Hannan AJ, Hill SL, Martone ME, McMahon A, Ruebel O, Crook S, Thiels E, Pestilli F (2021). International data governance for neuroscience. Neuron, 10.1016/j.neuron.2021.11.017. https://doi.org/10.1016/j.neuron.2021.11.017
Coding Style#
In writing programs, keep in mind:
Make them modular and aviod duplicate codes.
Give explanation at the beginning of each file/function.
Use file/function/variable names that you can comprehend a year later.
Never write a numeric parameter in an equation; define it as a variable/argument.
Give comments for significant parts/lines of codes.
Turn comment/uncomment into
if-else
for different modes of operation.Verify your code with a simple input for which the correct output is known.
Prepare documentation even before somebody asks you, as you yourself will need that after a few months.
In some projects, all you need is to download pre-existing tools and apply them to the data. Even in that case, it is better to record the procedure as a script for
avioding/detecting manual errors
reproducibility of the result
re-analysis with new data, tool, or parameters
Scritping#
On Unix-like systems, the common way is a shell script, which is a file containing a series of commands you would type into a terminal.
For a more elaborate processing, a Python script is often preferred.
Pipeline Tools#
Once your procedures for data processing is determined, such as filtering, visualization, and statistical tests, the sequence should be defined as a script with folders, filenames, and parameters.
A classic way in Unix-like system is shell script, but you can use Python for data processing scripts. There are dedicated packages for data processing pipelines, such as:
scikit-learn.pipeline: https://scikit-learn.org/stable/modules/compose.html
Luigi: spotify/luigi
Prefect: https://www.prefect.io
Snakemake: https://snakemake.github.io
Software/Data Licenses#
Today, increasingly more journals and agencies request that you make the data and programs publicly accessible for
reproducibility of research results
enable meta-analysis
facilitate reuse of data and programs
You should set an appropriate condition in making your data or program public, to facillitate their use and to keep your (and your organization’s) intellectural property. Points of consideration in making your data/programs public include:
copyright
acknowledgement
revision
re-distribution
commercial use
It is also important to know the licenses of the software you use for your development, as that can limit the way you can use/distribute your programs.
Creative Commons#
Creative Commons (https://creativecommons.org) is an emerging standard using combination of three aspects:
Attribution (BY): request aknowldgement, e.g., citing a paper
NonCommercial (NC): no commercial use
ShareAlike (SA) or NoDerivs (ND): allow modification and re-distribution or not
See https://creativecommons.org/licenses/?lang=en for typical combinations.
GPL, BSD, MIT, Apache, etc.#
In open software community, several types of licensing have been commonly used:
Gnu General Public Licence (GPL): redistribution requires access to source codes in the same license. Called copy left.
BSD and MIT license: do not require source code access or succession of the same license.
Apache License: does not even require the license terms.
Public Domain (CC0): no copyright insisted. Free to use/modify/distribute.
See https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses for further details.
Data Management#
Most research start with obtaining raw data, continues on with a series of pre-processing, visualization and analyses, and complete with paper writing. Handling all different files without confusion and corruption takes some good thoughts and habits.
Keep the raw data and metadata and take back up.
Store data as you wish to see when receiving.
Record all the steps of processing, better with a script.
Always Backup#
As soon as you obtain data, don’t forget to take a backup with appropriate documentation.
For a small scale data, DropBox is an easy solution for data backup and sharing.
At OIST, for storing large scale data, you can use the bucket drive. See: https://groups.oist.jp/it/research-storage
In a Unix like system, rsync
is the basic command to take a backup of a folder.
There are options for incremental backup, by searching for new files in the folder and copy them.
Data sharing#
Data sharing is an important emerging issue in the scientific community, as science today is becoming more and more data intensive. In good old days, each researcher did an experiment, gathered data, wrote a paper, and that was the end of the story. Nowadays, each experiment can produce Giga to Tera bytes of data, which are much more than just one researcher to analyze by him/herself. We nee efficient and reliable methods to share data within each lab, across collaboration labs, and the entire research community.
Data Governance#
When making data public, especially human subject data, a good care has to be taken for the privacy. In general
data should be anonymized so that the identity of subject cannot be obtained or inferred.
prior consent must be obtained from the subject regarding the way their data are made public.
Metadata#
Metadata is data about data. It usually includes:
Time and date of creation
Creator or author of the data
Method for creating the data
File format
File size
…
Different research communities have their own standards of metadata, such as
ISO-TC211 for geographic data: https://www.isotc211.org
ISA for biomedical data: https://www.isacommons.org
Following such a standard can help you using common data processing tools, and your data to be found and utilized by more people.
Data File Formats#
It is always better to save your data in a common file format so that they can be read by many data processing tools.
CSV, TSV#
Values separated by comma or tab, in multiple lines like a table. These are still commonly used for simplicity.
XML#
https://www.xml.org Keys and values stored in a form similar to html. Often used to store metadata.
JSON#
https://www.json.org/json-en.html Common in exchanging large data with multiple compnents.
HDF5#
https://www.hdfgroup.org Hierarchical datar format that can also store binary data.
Some domain-specific data formats are based HDF5, such as Neurodata Without Borders (NWB) https://www.nwb.org