On the use of command line interfaces

Dear Young Tim,

You will want your analyses and overall projects to be reproducible.
Reproducibility will necessitate description.
You will need to describe what your analysis does and how to do it.
One language to describe your analysis and project will be a directed acyclic graph (DAG).

DAGs are everywhere.
Analysis is a DAG.”
Workflows are DAGs.
DAGs have nodes, by definition.

These nodes are:

These things, i.e, these nodes:

  • take inputs (possibly None),
  • act in predefined ways,
  • and return outputs (possibly None).

Managing our nodes’ inputs and outputs (I/O) will enable our project’s reproducibility. Indeed, we cannot reproduce what we did not record. We must record the inputs that led to specific outputs, to later reproduce those outputs.

Ideally, we should manage our I/O with plain text configuration files (to ease tracking). These configuration files should specify, for each node:

  • all inputs or input data locations, and
  • the locations of all outputs to be created.

Next, we need to ingest these configuration files to use them. A useful ingestion pattern is to write the nodes / functions as command line interfaces (CLIs) that take the configuration as input. See tools such as Hydra for enabling such CLIs. Configuration-driven CLIs increase reproducibility by facilitating use of program call patterns that can be easily tracked for later replay.

Finally, we can use a version-control system such as git to track and save our CLIs, our configuration files, and our workflows (ideally). This set up would also pair nicely with project templates such as Pyscaffoldext-dsproject. We could use:

  • ./scripts/ to store our CLIs,
  • ./configs/ to store our configuration files,
  • ./Makefile to store our workflow definitions
    (or use whatever common file format we desire),
  • ./src/{PROJECT_PACKAGE} to store the reusable portion of our project that is used in our CLIs and packaged / shared.

If we put all these lessons together we arrive at:

  • version control of our project using a tool like git
  • a project template such as pyscaffoldext-dsproject
  • entire project workflow defined as a DAG in plain-text specification
    (e.g. Makefile, JSON, or YAML)
  • workflow I/O managed with plain text configuration files (e.g. YAML)
  • each step of project defined as a function (i.e., a CLI) that takes the configuration as an input.

These steps alone will take us far towards making our project reproducible.