Chapter 9 Tips and Resources for Reproducible Workflow

9.1 Reproducible workflows:

Below is a summary from Chapter 11 of Christensen, Freese, and Miguel (2019). If there is a book/chapter that you find particularly helpful for, please write a brief summary and submit a contribution.

Folder organization

Basic file organization is a critical component of a reproducible workflow. The following structure is recommended, but can be adapted to accommodate different reproducers or types of research. The name of the master folder should be easy to read and meaningful to all collaborators on the the project.

  • Create a master folder with a descriptive name for the project. It should contain:
    • Separate folders for programming script files, raw data, edited data, output, and final paper or article text
    • A README file: description of contents of each folder, as well as installation and operating instructions for reproducers
  • Keep raw data intact: Any edits or datasets generated using raw data should be stored in a “data” folder separate from the “raw data” folder.
  • When naming a directory or file, stick to lowercase letters with underscores (instead of spaces) to avoid cross-operating-system issues.

Efficient and readable programming

The core of programming for reproducibility is to write code wherever possible. Writing scripts leaves a record of any changes to data, which allows other researchers to reproduce work exactly. It is also helpful to leave comments in your code to explain the reasoning for changes or any gaps left if using point-and-click methods is necessary.

  • Leave a record of any changes to the data: Write code in the programming environment, instead of modifying data by hand in a spreadsheet or relying on point-and-click options.
  • Include comments in code to explain changes, and save intermediate datasets used in analysis.
  • Give variables names that will be informative to reproducers.
  • Use relative directory paths, not absolute paths, so the work can be more easily reproduced from different computers.

Version control

Version control software is used to keep a record of changes to project files. Although it is possible to manually track changes in a central research log or as notes in individual script files, many social scientists recognize the benefits of a distributed version control system. Because each collaborator is able to have a local copy of the project’s entire work history, these systems are particularly suited to collaborative projects. Below are methods to manually track changes and a brief explanation of Git, a popular distributed version control system.

  • Maintain a written record of work.
    • In a central research log: Log activities in a single central file as often as work on the project is being done (keep track of “which team member writes what code, produces what output, edits which files, and when”).
    • In individual script files: Record “who edited which part of which file when, and why.”
    • With a version control system, such as Git: Git records changes made to files, by whom, and when.
  • A brief explanation of Git: Users add changed files to the staging area, then commit those changes to the project folder, or repository. Git keeps the filename and records the new version of each file from the staging area.

References

Christensen, Garret, Jeremy Freese, and Edward Miguel. 2019. Transparent and Reproducible Social Science Research: How to Do Open Science. University of California Press.