10. Software Management#

In addition to writing a program for each computing task, knowldge and skills are needed for designing and managing the entire data analysis or simulation procedure, testing and revising the codes, and sharing the data and tools among collaborators.

Unlike commercial software development, research computing often starts from a simple exploratory code created by a single researcher. However, even for a single-person project, it is beneficial to follow the standard practices in software development because.

  • If your project is successful, it will be succeeded by other members of your lab or the research community world wide.

  • You, after a few months, do not remember where you put which file, what was this file, or why you wrote this code.

References:#

Coding Style#

In writing programs, keep in mind:

  • Make them modular and aviod duplicate codes.

  • Give explanation at the beginning of each file/function.

  • Use file/function/variable names that you can comprehend a year later.

  • Never write a numeric parameter in an equation; define it as a variable/argument.

  • Give comments for significant parts/lines of codes.

  • Turn comment/uncomment into if-else for different modes of operation.

  • Verify your code with a simple input for which the correct output is known.

  • Prepare documentation even before somebody asks you, as you yourself will need that after a few months.

In some projects, all you need is to download pre-existing tools and apply them to the data. Even in that case, it is better to record the procedure as a script for

  • avioding/detecting manual errors

  • reproducibility of the result

  • re-analysis with new data, tool, or parameters

Scritping#

On Unix-like systems, the common way is a shell script, which is a file containing a series of commands you would type into a terminal.

For a more elaborate processing, a Python script is often preferred.

Pipeline Tools#

Once your procedures for data processing is determined, such as filtering, visualization, and statistical tests, the sequence should be defined as a script with folders, filenames, and parameters.

A classic way in Unix-like system is shell script, but you can use Python for data processing scripts. There are dedicated packages for data processing pipelines, such as:

Software/Data Licenses#

Today, increasingly more journals and agencies request that you make the data and programs publicly accessible for

  • reproducibility of research results

  • enable meta-analysis

  • facilitate reuse of data and programs

You should set an appropriate condition in making your data or program public, to facillitate their use and to keep your (and your organization’s) intellectural property. Points of consideration in making your data/programs public include:

  • copyright

  • acknowledgement

  • revision

  • re-distribution

  • commercial use

It is also important to know the licenses of the software you use for your development, as that can limit the way you can use/distribute your programs.

Creative Commons#

Creative Commons (https://creativecommons.org) is an emerging standard using combination of three aspects:

  • Attribution (BY): request aknowldgement, e.g., citing a paper

  • NonCommercial (NC): no commercial use

  • ShareAlike (SA) or NoDerivs (ND): allow modification and re-distribution or not

See https://creativecommons.org/licenses/?lang=en for typical combinations.

GPL, BSD, MIT, Apache, etc.#

In open software community, several types of licensing have been commonly used:

  • Gnu General Public Licence (GPL): redistribution requires access to source codes in the same license. Called copy left.

  • BSD and MIT license: do not require source code access or succession of the same license.

  • Apache License: does not even require the license terms.

  • Public Domain (CC0): no copyright insisted. Free to use/modify/distribute.

See https://en.wikipedia.org/wiki/Comparison_of_free_and_open-source_software_licenses for further details.

Data Management#

Most research start with obtaining raw data, continues on with a series of pre-processing, visualization and analyses, and complete with paper writing. Handling all different files without confusion and corruption takes some good thoughts and habits.

  • Keep the raw data and metadata and take back up.

  • Store data as you wish to see when receiving.

  • Record all the steps of processing, better with a script.

Always Backup#

As soon as you obtain data, don’t forget to take a backup with appropriate documentation.

For a small scale data, DropBox is an easy solution for data backup and sharing.

At OIST, for storing large scale data, you can use the bucket drive. See: https://groups.oist.jp/it/research-storage

In a Unix like system, rsync is the basic command to take a backup of a folder. There are options for incremental backup, by searching for new files in the folder and copy them.

Data sharing#

Data sharing is an important emerging issue in the scientific community, as science today is becoming more and more data intensive. In good old days, each researcher did an experiment, gathered data, wrote a paper, and that was the end of the story. Nowadays, each experiment can produce Giga to Tera bytes of data, which are much more than just one researcher to analyze by him/herself. We nee efficient and reliable methods to share data within each lab, across collaboration labs, and the entire research community.

Data Governance#

When making data public, especially human subject data, a good care has to be taken for the privacy. In general

  • data should be anonymized so that the identity of subject cannot be obtained or inferred.

  • prior consent must be obtained from the subject regarding the way their data are made public.

Metadata#

Metadata is data about data. It usually includes:

  • Time and date of creation

  • Creator or author of the data

  • Method for creating the data

  • File format

  • File size

Different research communities have their own standards of metadata, such as

Following such a standard can help you using common data processing tools, and your data to be found and utilized by more people.

Data File Formats#

It is always better to save your data in a common file format so that they can be read by many data processing tools.

CSV, TSV#

Values separated by comma or tab, in multiple lines like a table. These are still commonly used for simplicity.

XML#

https://www.xml.org Keys and values stored in a form similar to html. Often used to store metadata.

JSON#

https://www.json.org/json-en.html Common in exchanging large data with multiple compnents.

HDF5#

https://www.hdfgroup.org Hierarchical datar format that can also store binary data.

Some domain-specific data formats are based HDF5, such as Neurodata Without Borders (NWB) https://www.nwb.org

Version Control System#

Software development is repetitions of coding, testing, and improving. A version control system (VCS) allows

  • parallel development of parts and re-integration

  • trace back to previous versions when a problem is detected

Git#

The most popular VCS today is Git, created by Linus Torvalds for developing Linux.

If git has not been installed, follow one of these to install.

Mac:

  • Install XCode from the App Store

  • or install XCode Command Line Tools by xcode-select --install

  • or install HomeBrew and run brew install git

Windows:

Detailed documentations can be found at https://git-scm.com/docs

Starting a repository#

In your working directory, you can start a new repository by git init

Here we use an example of a directory containing a python module cell.py that we created in Chapter 4.

%mkdir cell
%cd cell
mkdir: cell: File exists
/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/cell
%%file cell.py
"""Classes for cells"""

import numpy as np
import matplotlib.pyplot as plt

class Cell:
    """Class for a cell"""

    def __init__(self, position = [0,0], radius=0.1, color=[1,0,0,0.5]):
        """Make a new cell"""
        self.position = np.array(position)
        self.radius = radius
        self.color = color
     
    def show(self):
        """Visualize as a circule"""
        c = plt.Circle(self.position,self.radius,color=self.color)
        plt.gca().add_patch(c)
        plt.axis('equal')

if __name__ == "__main__":
    c0 = Cell()
    c0.show()
    plt.show()
Overwriting cell.py
%pwd
'/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/cell'

Try running this code.

%run cell.py
_images/6cbc034b54d535d25600d68cb94dc1568584de82d7c44e3316f95ee172843a61.png

Now we create a new repository.

!git init
Reinitialized existing Git repository in /Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/cell/.git/

This creates an invisible folder .git for book keeping.

%ls -a
./           ../          .git/        __pycache__/ cell.py      gcell.py
# The contents of .git folder
%ls .git
COMMIT_EDITMSG  config          index           objects/
HEAD            description     info/           refs/
ORIG_HEAD       hooks/          logs/

You can check the status of the repository by git status

!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

Staging and Commiting files#

You can use git add to add files for staging.

And then git commit to register a version.

commit

!git add cell.py
!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

Register the current version by git commit with a message by -m.

!git commit -m "initial version"
!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

Registering changes#

After editing a file, you can register a new version by git add and then git commit.

Please edit cell.py or add another file to the directory and check the status.

!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

Use git add to stage updated files.

!git add cell.py
!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

And the git commit the changes

!git commit -m "cell.py updated"
!git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)

You can see what was changed by git show.

!git show
commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> main, myBranch)
Author: Kenji Doya <doya@oist.jp>
Date:   Mon Dec 16 14:54:00 2024 +0900

    added gcell.py

diff --git a/gcell.py b/gcell.py
new file mode 100644
index 0000000..cd7b2a5
--- /dev/null
+++ b/gcell.py
@@ -0,0 +1,25 @@
+"""Classes for cells"""
+
+import numpy as np
+import matplotlib.pyplot as plt
+import cell
+
+class gCell(cell.Cell):
+    """Class of growing cell based on Cell class"""
+    
+    def grow(self, scale=2):
+        """Grow the area of the cell"""
+        self.radius *= np.sqrt(scale)
+        
+    def duplicate(self):
+        """Make a copy with a random shift"""
+        c = gCell(self.position+np.random.randn(2)*self.radius, self.radius, self.color)
+        return c
+
+if __name__ == "__main__":
+    c0 = gCell()
+    c0.show()
+    c1 = c0.duplicate()
+    c1.grow()
+    c1.show()
+    plt.show()

You can check the revision history by git log

!git log
commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> main, myBranch)
Author: Kenji Doya <doya@oist.jp>
Date:   Mon Dec 16 14:54:00 2024 +0900

    added gcell.py

commit fba656aec31be16236f9de2b429e91fac5ef7b2a
Author: Kenji Doya <doya@oist.jp>
Date:   Mon Dec 16 14:53:57 2024 +0900

    initial version
!pwd
/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/cell

Branch#

You can create a new branch to update the codes while keeping the current version untouched.

After creating a branch, switch to that branch by git checkout.

!git branch myBranch
!git checkout myBranch
fatal: A branch named 'myBranch' already exists.
Switched to branch 'myBranch'

Now let us add a new module gCell.

!git status
On branch myBranch
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
%%file gcell.py
"""Classes for cells"""

import numpy as np
import matplotlib.pyplot as plt
import cell

class gCell(cell.Cell):
    """Class of growing cell based on Cell class"""
    
    def grow(self, scale=2):
        """Grow the area of the cell"""
        self.radius *= np.sqrt(scale)
        
    def duplicate(self):
        """Make a copy with a random shift"""
        c = gCell(self.position+np.random.randn(2)*self.radius, self.radius, self.color)
        return c

if __name__ == "__main__":
    c0 = gCell()
    c0.show()
    c1 = c0.duplicate()
    c1.grow()
    c1.show()
    plt.show()
Overwriting gcell.py
!ls
__pycache__ cell.py     gcell.py
%run gcell.py
_images/6577ef6792c42ec158b7c152e761250fbcde0383075efb90ab147ec8266cf0fb.png

Then git add and git commit.

!git add gcell.py
!git commit -m "added gcell.py"
!git status
On branch myBranch
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
On branch myBranch
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
!git show
commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> myBranch, main)
Author: Kenji Doya <doya@oist.jp>
Date:   Mon Dec 16 14:54:00 2024 +0900

    added gcell.py

diff --git a/gcell.py b/gcell.py
new file mode 100644
index 0000000..cd7b2a5
--- /dev/null
+++ b/gcell.py
@@ -0,0 +1,25 @@
+"""Classes for cells"""
+
+import numpy as np
+import matplotlib.pyplot as plt
+import cell
+
+class gCell(cell.Cell):
+    """Class of growing cell based on Cell class"""
+    
+    def grow(self, scale=2):
+        """Grow the area of the cell"""
+        self.radius *= np.sqrt(scale)
+        
+    def duplicate(self):
+        """Make a copy with a random shift"""
+        c = gCell(self.position+np.random.randn(2)*self.radius, self.radius, self.color)
+        return c
+
+if __name__ == "__main__":
+    c0 = gCell()
+    c0.show()
+    c1 = c0.duplicate()
+    c1.grow()
+    c1.show()
+    plt.show()
!git log --all --graph
* commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> myBranch, main)
| Author: Kenji Doya <doya@oist.jp>
| Date:   Mon Dec 16 14:54:00 2024 +0900
| 
|     added gcell.py
| 
* commit fba656aec31be16236f9de2b429e91fac5ef7b2a
  Author: Kenji Doya <doya@oist.jp>
  Date:   Mon Dec 16 14:53:57 2024 +0900
  
      initial version

You can go back to a previous branch by checkout.

!git checkout main
!git log --all --graph
Switched to branch 'main'
* commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> main, myBranch)
| Author: Kenji Doya <doya@oist.jp>
| Date:   Mon Dec 16 14:54:00 2024 +0900
| 
|     added gcell.py
| 
* commit fba656aec31be16236f9de2b429e91fac5ef7b2a
  Author: Kenji Doya <doya@oist.jp>
  Date:   Mon Dec 16 14:53:57 2024 +0900
  
      initial version
%ls
__pycache__/ cell.py      gcell.py
!git branch
* main
  myBranch

You can merge another branche to the current branch by git merge

!git merge myBranch
!git log --all --graph
Already up to date.
* commit ada6e011b864c93c202a176b18ab3b01db8f4801 (HEAD -> main, myBranch)
| Author: Kenji Doya <doya@oist.jp>
| Date:   Mon Dec 16 14:54:00 2024 +0900
| 
|     added gcell.py
| 
* commit fba656aec31be16236f9de2b429e91fac5ef7b2a
  Author: Kenji Doya <doya@oist.jp>
  Date:   Mon Dec 16 14:53:57 2024 +0900
  
      initial version

GitHub#

GitHub is currently the most popular cloud service for sharing software. It is free for open software.

This is a good platform for sharing programs, or in some cases text data and manuscripts, among collaborators. It is also helpful for a single-person project, for succession by a future member of your lab, for open access after publication, or for yourself after some time.

These are typical steps in contributing to a project in GitHub.

  • Join as a member of a repository.

  • Copy the existing files and see how they work.

  • Make a new branch and add or modify the codes.

  • After tesing locally, commit the new version.

  • Open a pull request for other members to test your revision.

  • Your pull request is merged into the master branch.

from Hello World

See “Hello World” in GitHub Guide for details (https://guides.github.com).

Cloning a repository#

If you just use a copy of a stable software, and not going to contribute your changes, just downloading a zip file is fine.

But if you would congribute to joint development, or catch up with updates, git clone is the better way.

Cloning odes repository#

Let us try with a simple ODE simulator odes.py on:
doya-oist/odes

To download a copy of the repository, run

git clone https://github.com/doya-oist/odes.git

This should create a folder odes.

%cd ..
/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp
%pwd
'/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp'
!git clone https://github.com/doya-oist/odes.git
fatal: destination path 'odes' already exists and is not an empty directory.
%ls
01_Introduction.ipynb                  09_Stochastic_Sol.ipynb
01_Introduction_Ex.ipynb               10_Management.ipynb
01_Introduction_Sol.ipynb              10_Management.ipynbのコピー
02_Visualization.ipynb                 10_Management_Ex.ipynb
02_Visualization_3D.ipynb              10_Management_Ex.ipynbのコピー
02_Visualization_Animation.ipynb       ComputationalMethods2022/
02_Visualization_Ex.ipynb              LICENSE
02_Visualization_Sol.ipynb             README.md
03_Matrix.ipynb                        References.ipynb
03_Matrix_Ex.ipynb                     SC_logo.png
03_Matrix_Exx.ipynb                    Solutions.ipynb
03_Matrix_Sol.ipynb                    VdP.pdf
04_Function.ipynb                      __pycache__/
04_Function_Ex.ipynb                   _build/
04_Function_Sol.ipynb                  _config.yml
05_Iteration.ipynb                     _toc.yml
05_Iteration_Ex.ipynb                  cell/
05_Iteration_Sol.ipynb                 cell1/
06_ODE.ipynb                           cell2/
06_ODE_Ex.ipynb                        data/
06_ODE_Exx.ipynb                       figures/
06_ODE_Sol.ipynb                       hello.py
07_PDE.ipynb                           iSciComp.bib
07_PDE_Ex.ipynb                        iSciComp.ipynb
07_PDE_Sol.ipynb                       index.html
07_PDE_Turing.ipynb                    lpnorm.py
08_Optimization.ipynb                  odes/
08_Optimization_Ex.ipynb               odes1/
08_Optimization_Exx.ipynb              odes2/
08_Optimization_Sol.ipynb              odesim/
09_Stochastic.ipynb                    pend.gif
09_Stochastic_Ex.ipynb

Move into the folder and test odesim.py program.

%cd odes
/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/odes
%ls
LICENSE      __pycache__/ odes.py      vdp.py
README.md    first.py     second.py

From the console you can run interactively after reading the module as:

python -i odes.py

sim = odesim('first')

sim.run()

from odes import *
sim = odes('first')
Importing ODE: first
sim.run()
t= 10.0 ; state= [-0.36787947]
_images/284e4cac6d9370899e72b59adaa762547e398dd037e906be863b20207aad3881.png

Your branch#

Now make your own branch, check it out, and add your own ODE module.

!git branch myname
!git checkout myname
fatal: A branch named 'myname' already exists.
Already on 'myname'

Make a copy of a dynamics file first.py or second.py, implement your own ODE, and save with a new name, e.g. vdp.py.

%%file vdp.py
# vdp.py
# van der Pol oscillator
# Dec. 2018 by Kenji Doya

import numpy as np

# Right-hand-side function of the ODE
def dynamics(y, t, mu=1.):
    """van der Pol oscillator:
        d2y/dt2 = mu*(1 - y**2)*dy/dt - y"""
    y1, y2 = y
    return np.array([y2, mu*(1 - y1**2)*y2 - y1])

# Default parameters
parameters = 1.

# Default initial state
initial_state = [1, 0]
Overwriting vdp.py
%pwd
'/Users/doya/OIST Dropbox/kenji doya/Python/iSciComp/odes'

Run odes and confirm that your ODE runs appropriately.

sim = odes('vdp')
Importing ODE: vdp
sim.run()
t= 10.0 ; state= [-1.58203155  0.73418353]
_images/63523379b9da90bfffd4bc3d13be573b617654a0c336d94f0ad7462bd3ae4c96.png

Then you can add and commit your change.

!git status
On branch myname
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
!git add vdp.py
!git commit -m "adding my model vdp.py"
On branch myname
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	__pycache__/

nothing added to commit but untracked files present (use "git add" to track)
!git log --graph --all
* commit c76d3ef44edd3033e1411f3f609747ddd2362eb1 (HEAD -> myname)
| Author: Kenji Doya <doya@oist.jp>
| Date:   Mon Dec 16 14:54:04 2024 +0900
| 
|     adding my model vdp.py
| 
* commit 670bad23cf98271d2017e8d9034ae3337bad3122 (origin/main, origin/HEAD, main)
| Author: Kenji Doya <doya@oist.jp>
| Date:   Mon Dec 16 14:28:15 2024 +0900
| 
|     first set of files
| 
* commit 0c1c8b1334ffe99caf36f502b5f400a1eeb84121
  Author: Kenji Doya <doya@oist.jp>
  Date:   Mon Dec 16 12:42:22 2024 +0900
  
      Initial commit

Now push your branch to GitHub repository by, e.g.

git push origin myname

!git push origin myname
Username for 'https://github.com': 
^C

Check the status on GitHub: oist/ComputationalMethods2022

and make a pull request for the repository administrator to check your updates.

The administrator may reply back with a comment for revision or merge your change to the main branch.

Pulling updates#

While you are working on your local code, the codes on the origial repository may be updated. You may also want to check the branches other people have created.

You can use git pull to reflect the changes in the GitHub to your local repository.

You can use git branch to see what branches are there and git checkout to try with the codes in other branches.

!git pull
!git branch

Optional) In addition to adding a new module, you are welcome to improve the main program odesim.py itself. For example,

  • add other visualization like a phese plot.

  • fix any bugs or improve error handling.

  • add documentation.