# 10. Software Management

In addition to writing a program for each computing task, knowldge and skills are needed for designing and managing the entire data analysis or simulation procedure, testing and revising the codes, and sharing the data and tools among collaborators.

Unlike commercial software development, research computing often starts from a simple exploratory code created by a single researcher. However, even for a single-person project, it is beneficial to follow the standard practices in software development because.
* If your project is successful, it will be succeeded by other members of your lab or the research community world wide.
* You, after a few months, do not remember where you put which file, what was this file, or why you wrote this code.

#### Reference:
* Greg Wilson, et al. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6): e1005510 (https://doi.org/10.1371/journal.pcbi.1005510)

## Coding Style

In writing programs, keep in mind:
* Make them *modular* and aviod duplicate codes.
* Give explanation at the beginning of each file/function.
* Use file/function/variable names that you can comprehend a year later.
* Never write a numeric parameter in an equation; define it as a variable/argument.
* Give comments for significant parts/lines of codes.
* Turn comment/uncomment into `if-else` for different modes of operation.
* Verify your code with a simple input for which the correct output is known.
* Prepare documentation even before somebody asks you, as you yourself will need that after a few months.

In some projects, all you need is to download pre-existing tools and apply them to the data. Even in that case, it is better to record the procedure as a *script* for
* avioding/detecting manual errors
* reproducibility of the result
* re-analysis with new data, tool, or parameters

### Scritping
On Unix-like systems, the common way is a *shell script*, which is a file containing a series of commands you would type into a terminal.

For a more elaborate processing, a Python script is often preferred.

## Version Control System
Software development is repetitions of coding, testing, and improving. A version control system (VCS) allows
* parallel development of parts and re-integration
* trace back to previous versions when a problem is detected

## Git

The most popular VCS today is *Git*, created by Linus Torvalds for developing Linux. 

After creating/editing your files, you *stage* them for management and *commit* for a certain version.

![commit](figures/git_commit.png)

If `git` has not been installed, follow one of these to install.

Mac: 
* Install [XCode](https://developer.apple.com/jp/xcode/) from the *App Store*
* or install *XCode Command Line Tools* by `xcode-select --install`
* or install [HomeBrew](https://brew.sh) and run `brew install git`

Windows: 
* Install [Git for Windows](https://gitforwindows.org)

Detailed documentations can be found at https://git-scm.com/docs

## odesim

As an example of version control, let us take a simple ODE simulator, *odesim*

In [None]:
%ls odesim

In [1]:
%cd odesim

/Users/doya/Dropbox (OIST)/Python/ComputationalMethods/odesim


In [None]:
%cat odesim.py

Here is its example usage

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import odesim
import importlib
#importlib.reload(odesim)  # when odesim.py is updated

In [None]:
sim = odesim.odesim('first')

In [None]:
sim.run()

In [None]:
sim2 = odesim.odesim('second')

In [None]:
sim2.run()

### Starting a repository

Let us go to your working folder and start a new repository by `git init`

In [None]:
%pwd

In [2]:
!git init

Initialized empty Git repository in /Users/doya/Dropbox (OIST)/Python/ComputationalMethods/odesim/.git/


This creates an invisible folder `.git` for book keeping.

In [3]:
%ls -a

[34m.[m[m/         [34m..[m[m/        .DS_Store  [34m.git[m[m/      first.py   odesim.py  second.py


In [4]:
# The contents of .git folder
%ls .git

HEAD         description  [34minfo[m[m/        [34mrefs[m[m/
config       [34mhooks[m[m/       [34mobjects[m[m/


You can check the status by `git status`

In [5]:
!git status

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.DS_Store[m
	[31mfirst.py[m
	[31modesim.py[m
	[31msecond.py[m

nothing added to commit but untracked files present (use "git add" to track)


### Staging and Commiting files

Use `git add` to add files for tracking.

And then `git commit` to save a version.

![commit](figures/git_commit.png)

In [6]:
!git add *.py
!git status

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	[32mnew file:   first.py[m
	[32mnew file:   odesim.py[m
	[32mnew file:   second.py[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.DS_Store[m



Register the current version by `git commit` with a message by `-m`.

In [7]:
!git commit -m "first version"
!git status

[main (root-commit) 9a1b45a] first version
 3 files changed, 83 insertions(+)
 create mode 100644 first.py
 create mode 100644 odesim.py
 create mode 100644 second.py
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.DS_Store[m

nothing added to commit but untracked files present (use "git add" to track)


You can list files that need not to be tracked in `.gitigonore` file

In [8]:
!echo '.*\n__*\n*~\n' > .gitignore
!cat .gitignore
!git status

.*
__*
*~

On branch main
nothing to commit, working tree clean


### Registering changes
After editing a file, you can register a new version by `git commit`.

Try changing a parameter or initial state, e.g., in `dynamics/first.py`

In [10]:
!git status

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   first.py[m

no changes added to commit (use "git add" and/or "git commit -a")


Use `git add` to stage updated files.

In [11]:
!git add first.py
!git status

On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mmodified:   first.py[m



And the `git commit` the changes

In [12]:
!git commit first.py -m "first.py updated"
!git status

[main 8b01220] first.py updated
 1 file changed, 1 insertion(+), 1 deletion(-)
On branch main
nothing to commit, working tree clean


You can see what was changed by `git show`.

In [13]:
!git show

[33mcommit 8b01220bd8dfe78f71297e62e962758e4977caa4[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m)[m
Author: Kenji Doya <doya@oist.jp>
Date:   Tue Dec 6 12:24:17 2022 +0900

    first.py updated

[1mdiff --git a/first.py b/first.py[m
[1mindex 11d4ee4..3549675 100644[m
[1m--- a/first.py[m
[1m+++ b/first.py[m
[36m@@ -13,4 +13,4 @@[m [mdef dynamics(y, t, a):[m
 parameters = [-0.1, 0][m
 [m
 # Default initial state[m
[31m-initial_state = 1[m
[32m+[m[32minitial_state = -1[m


You can check the revision history by `git log`

In [14]:
!git log

[33mcommit 8b01220bd8dfe78f71297e62e962758e4977caa4[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m)[m
Author: Kenji Doya <doya@oist.jp>
Date:   Tue Dec 6 12:24:17 2022 +0900

    first.py updated

[33mcommit 9a1b45a6daceb403255b30febd3b0dc8fb3f58ee[m
Author: Kenji Doya <doya@oist.jp>
Date:   Tue Dec 6 12:22:25 2022 +0900

    first version


### Branch

You can create a new *branch* and *checkout* a particular branch.

In [15]:
!git branch myBranch
!git checkout myBranch

Switched to branch 'myBranch'


Make a change, e.g., editing `second.py`.

And then `git add` and `git commit`.

In [16]:
!git add second.py
!git commit -m "second.py updated"
!git status

[myBranch 9b97d69] second.py updated
 1 file changed, 1 insertion(+), 1 deletion(-)
On branch myBranch
nothing to commit, working tree clean


In [17]:
!git show

[33mcommit 9b97d69bb5a0872bf342a2cba9e53a6ec6c5f590[m[33m ([m[1;36mHEAD -> [m[1;32mmyBranch[m[33m)[m
Author: Kenji Doya <doya@oist.jp>
Date:   Tue Dec 6 12:24:44 2022 +0900

    second.py updated

[1mdiff --git a/second.py b/second.py[m
[1mindex 785645a..2c8638b 100644[m
[1m--- a/second.py[m
[1m+++ b/second.py[m
[36m@@ -15,4 +15,4 @@[m [mdef dynamics(y, t, a):[m
 parameters = [-0.1, -1, 0][m
 [m
 # Default initial state[m
[31m-initial_state = [1, 0][m
[32m+[m[32minitial_state = [-1, 0][m


In [18]:
!git log --all --graph

* [33mcommit 9b97d69bb5a0872bf342a2cba9e53a6ec6c5f590[m[33m ([m[1;36mHEAD -> [m[1;32mmyBranch[m[33m)[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:44 2022 +0900
[31m|[m 
[31m|[m     second.py updated
[31m|[m 
* [33mcommit 8b01220bd8dfe78f71297e62e962758e4977caa4[m[33m ([m[1;32mmain[m[33m)[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:17 2022 +0900
[31m|[m 
[31m|[m     first.py updated
[31m|[m 
* [33mcommit 9a1b45a6daceb403255b30febd3b0dc8fb3f58ee[m
  Author: Kenji Doya <doya@oist.jp>
  Date:   Tue Dec 6 12:22:25 2022 +0900
  
      first version


You can go back to a previous branch by *checkout*.

In [21]:
!git checkout main
!git log --all --graph

Switched to branch 'main'
* [33mcommit 9b97d69bb5a0872bf342a2cba9e53a6ec6c5f590[m[33m ([m[1;32mmyBranch[m[33m)[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:44 2022 +0900
[31m|[m 
[31m|[m     second.py updated
[31m|[m 
* [33mcommit 8b01220bd8dfe78f71297e62e962758e4977caa4[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m)[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:17 2022 +0900
[31m|[m 
[31m|[m     first.py updated
[31m|[m 
* [33mcommit 9a1b45a6daceb403255b30febd3b0dc8fb3f58ee[m
  Author: Kenji Doya <doya@oist.jp>
  Date:   Tue Dec 6 12:22:25 2022 +0900
  
      first version


In [20]:
!git branch

  main[m
* [32mmyBranch[m


You can merge another branche to the current branch by `git merge`

In [22]:
!git merge myBranch
!git log --all --graph

Updating 8b01220..9b97d69
Fast-forward
 second.py | 2 [32m+[m[31m-[m
 1 file changed, 1 insertion(+), 1 deletion(-)
* [33mcommit 9b97d69bb5a0872bf342a2cba9e53a6ec6c5f590[m[33m ([m[1;36mHEAD -> [m[1;32mmain[m[33m, [m[1;32mmyBranch[m[33m)[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:44 2022 +0900
[31m|[m 
[31m|[m     second.py updated
[31m|[m 
* [33mcommit 8b01220bd8dfe78f71297e62e962758e4977caa4[m
[31m|[m Author: Kenji Doya <doya@oist.jp>
[31m|[m Date:   Tue Dec 6 12:24:17 2022 +0900
[31m|[m 
[31m|[m     first.py updated
[31m|[m 
* [33mcommit 9a1b45a6daceb403255b30febd3b0dc8fb3f58ee[m
  Author: Kenji Doya <doya@oist.jp>
  Date:   Tue Dec 6 12:22:25 2022 +0900
  
      first version


## GitHub

*GitHub* is currently the most popular cloud service for sharing software. It is free for open software. 

This is a good platform for sharing programs, or in some cases text data and manuscripts, among collaborators. It is also helpful for a single-person project, for succession by a future member of your lab, for open access after publication, or for yourself after some time.

These are typical steps in contributing to a project in GitHub.
* Join as a member of a repository.
* Copy the existing files and see how they work.
* Make a new *branch* and add or modify the codes.
* After tesing locally, *commit* the new version.
* Open a *pull request* for other members to test your revision.
* Your pull request is merged into the *master* branch.

![from Hello World](https://docs.github.com/assets/cb-23923/images/help/repository/branching.png)

See "Hello World" in GitHub Guide for details (https://guides.github.com).

### Cloning a repository

If you just use a copy of a stable software, and not going to contribute your changes, just downloading a zip file is fine.

But if you would congribute to joint development, or catch up with updates, `git clone` is the better way.

### Cloning ComputationalMethods repository

To download a copy of the repository, run

```git clone git@github.com:oist/ComputationalMethods2022.git```

You are asked to input the passphrase you set in creating your SSH Key.

This should create a folder `ComputationalMethods2022`.

In [None]:
%pwd

In [None]:
!git clone git@github.com:oist/ComputationalMethods2022.git

In [None]:
%ls

Move into the folder and test `odesim.py` program.

In [None]:
%cd ComputationalMethods2022

In [None]:
%ls

From the console you can run interactively after reading the module as:

`python -i odesim.py`

`sim = odesim('first')`

`sim.run()`

In [None]:
from odesim import *

In [None]:
sim = odesim('first')

In [None]:
sim.run()

### Your branch

Now make your own branch, check it out, and add your own ODE module.

In [None]:
!git branch myname
!git checkout myname

Make a copy of a dynamics file `first.py` or `second.py`, implement your own ODE, and save with a new name, e.g. `vdp.py`.

Run odesim and confirm that your ODE runs appropriately.

Then you can add and commit your change.

In [None]:
!git status

In [None]:
!git add vdp.py

In [None]:
!git commit -m "adding my model vdp.py"

In [None]:
!git log --graph --oneline --all

Now push your branch to GitHub repository by, e.g.

`git push origin myname` 

In [None]:
!git push origin myname

Check the status on GitHub:
https://github.com/oist/ComputationalMethods2022

and make a pull request for the repository administrator to check your updates.

The administrator may reply back with a comment for revision or merge your change to the main branch.

### Pulling updates
While you are working on your local code, the codes on the origial repository may be updated. You may also want to check the branches other people have created.

You can use `git pull` to reflect the changes in the GitHub to your local repository.

You can use `git branch` to see what branches are there and `git checkout` to try with the codes in other branches.

In [None]:
!git pull

In [None]:
!git branch

Optional) In addition to adding a new module, you are welcome to improve the main program `odesim.py` itself. For example,

* add other visualization like a phese plot.

* fix any bugs or improve error handling.

* add documentation.

* ...

## Software/Data Licenses
Today, increasingly more journals and agencies request that you make the data and programs publicly accessible for
* reproducibility of research results
* enable meta-analysis
* facilitate reuse of data and programs

You should set an appropriate condition in making your data or program public, to facillitate their use and to keep your (and your organization's) intellectural property. Points of consideration in making your data/programs public include:
* copyright
* acknowledgement
* revision
* re-distribution
* commercial use

It is also important to know the licenses of the software you use for your development, as that can limit the way you can use/distribute your programs.

### Creative Commons

Creative Commons (https://creativecommons.org) is an emerging standard using combination of three aspects:

* Attribution (BY): request aknowldgement, e.g., citing a paper

* NonCommercial (NC): no commercial use

* ShareAlike (SA) or NoDerivs (ND): allow modification and re-distribution  or not

See https://creativecommons.org/licenses/?lang=en for typical combinations.

### GPL, BSD, MIT, Apache, etc.
In open software community, several types of licensing have been commonly used:
* Gnu General Public Licence (GPL): redistribution requires access to source codes in the same license. Called *copy left*.
* BSD and MIT license: do not require source code access or succession of the same license.
* Apache License: does not even require the license terms.
* Public Domain (CC0): no copyright insisted. Free to use/modify/distribute.

See https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses for further details.

## Data Management
Most research start with obtaining *raw* data, continues on with a series of pre-processing, visualization and analyses, and complete with paper writing. Handling all different files without confusion and corruption takes some good thoughts and habits.
* Keep the raw data and *metadata* and take back up.
* Store data as you wish to see when receiving.
* Record all the steps of processing, better with a script.

### References:

* Hart EM, et al. (2016). Ten simple rules for digital data storage. PLoS Comput Biol, 12, e1005097. https://doi.org/10.1371/journal.pcbi.1005097

* Ouellette F, et al. (2018). A FAIR guide for data providers to maximise sharing of human genomic data. PLoS Comput Biol, 14. https://doi.org/10.1371/journal.pcbi.1005873

* Eke DO, Bernard A, Bjaalie JG, Chavarriaga R, Hanakawa T, Hannan AJ, Hill SL, Martone ME, McMahon A, Ruebel O, Crook S, Thiels E, Pestilli F (2021). International data governance for neuroscience. Neuron, 10.1016/j.neuron.2021.11.017. https://doi.org/10.1016/j.neuron.2021.11.017

### Always Backup

As soon as you obtain data, don't forget to take a backup with appropriate documentation.

For a small scale data, *DropBox* is an easy solution for data backup and sharing.

At OIST, for storing large scale data, you can use the *bucket* drive. See:
https://groups.oist.jp/it/research-storage

In a Unix like system, `rsync` is the basic command to take a backup of a folder. 
There are options for incremental backup, by searching for new files in the folder and copy them.

## Data sharing

Data sharing is an important emerging issue in the scientific community, as science today is becoming more and more data intensive. 
In good old days, each researcher did an experiment, gathered data, wrote a paper, and that was the end of the story.
Nowadays, each experiment can produce Giga to Tera bytes of data, which are much more than just one researcher to analyze by him/herself.
We nee efficient and reliable methods to share data within each lab, across collaboration labs, and the entire research community.

### Data Governance

When making data public, especially human subject data, a good care has to be taken for the privacy. In general

* data should be anonymized so that the identity of subject cannot be obtained or inferred.

* prior consent must be obtained from the subject regarding the way their data are made public.

### Metadata
*Metadata* is data about data. It usually includes:

* Time and date of creation
* Creator or author of the data
* Method for creating the data
* File format
* File size
* ...

Different research communities have their own standards of metadata, such as 

* ISO-TC211 for geographic data: https://www.isotc211.org

* ISA for biomedical data: https://www.isacommons.org

Following such a standard can help you using common data processing tools, and your data to be found and utilized by more people.


## Data File Formats
It is always better to save your data in a common file format so that they can be read by many data processing tools.

### CSV, TSV
Values separated by comma or tab, in multiple lines like a table.
These are still commonly used for simplicity.

### XML
https://www.xml.org
Keys and values stored in a form similar to html.
Often used to store metadata.

### JSON
https://www.json.org/json-en.html
Common in exchanging large data with multiple compnents.

### HDF5
https://www.hdfgroup.org
Hierarchical datar format that can also store binary data.

Some domain-specific data formats are based HDF5, such as Neurodata Without Borders (NWB)  https://www.nwb.org

## Pipeline Tools
Once your procedures for data processing is determined, such as filtering, visualization, and statistical tests, the sequence should be defined as a *script* with folders, filenames, and parameters.

A classic way in Unix-like system is *shell script*, but you can use Python for data processing scripts. There are dedicated packages for data processing pipelines, such as:

* scikit-learn.pipeline: https://scikit-learn.org/stable/modules/compose.html
* Luigi: https://github.com/spotify/luigi
* Prefect: https://www.prefect.io
* Snakemake: https://snakemake.github.io
