Chapter 9 Examples of Reproduction Trees
A diagram generated by the Diagram Builder which represents all the available data and code on behind a specific display item. The tree is meant to represent the entire computational workflow behind a result from the paper. It allows reproducers to trace a display item to its primary sources. It can also be used to guide users of the reproduction package and/or to identify missing components for a complete reproduction.
A reproduction tree is complete when it is possible to connect its output (a given display item) with all of its inputs down to the raw data. A reproduction is incomplete when it is not possible to connect all the inputs to the resulting display item. Paraphrasing the author Leo Tolstoy, complete workflows are all alike; every incomplete computational workflow is incomplete in its own way. This chapter presents a few examples of reproduction trees, focusing particularly on the many possible ways in which a tree could be incomplete. If you have a reproduction tree that contains an instructive example please contribute to this chapter (via a pull request or emailing your reproduction tree to ACRE@berkeley.edu.)
9.1 Stylized examples
9.1.1 Complete reproduction tree
Below is an example of output from the Diagram Builder for a display item that can be fully constructed using the files contained in the reproduction package. The diagram displays files as outputs and inputs to code scripts all the way down to the raw data.
table 1
└───[code] formatting_table1.R
├───output1_part1.txt
| └───[code] output_table1.do
| └───[data] analysis_data01.csv
| └───[code] data_cleaning01.R
| └───[data] survey_01raw.csv
└───output1_part2.txt
└───[code] output_table2.do
└───[data] analysis_data02.csv
└───[code] data_cleaning02.R
└───[data] admin_01raw.csv
9.1.2 Incomplete reproduction tree
9.1.2.1 Raw data and analytic data are available, but cleaning code is missing.
Below is an example of output from the Diagram Builder for a display item that is missing some of the code needed to generate it from the raw data. There are two reasons to suspect that this workflow is incomplete: (i) there is no clear data cleaning step (only analysis that generates output and formatting), and (ii) there are unused files that are likely to be raw data. None of these reasons can confirm unequivocally that the tree is incomplete, but a reproducer familiar with the paper and its data sources could use the tree to certify its (in)completeness and request missing files.
table 1
└───[code] formatting_table1.R
├───output1_part1.txt
| └───[code] output_table1.do
| └───[data] analysis_data01.csv
└───output1_part2.txt
└───[code] output_table2.do
└───[data] analysis_data02.csv
Unused files:
- survey_01raw.csv
- admin_01raw.csv
Reproducers are asked to speculate on where the missing files might go, and hence propose how a complete tree might look like (where possible). For this example, we have assumed there are missing code scripts that at some point take in survey_01raw.csv
and admin_01raw.csv
, and eventually output analysis_data01.csv
and analysis_data02.csv
, though this requires the reproducer’s discretion.
table 1
└───[code] formatting_table1.R
├───output1_part1.txt
| └───[code] output_table1.do
| └───[data] analysis_data01.csv
| └───[code] MISSING FILE(S)
| └───[data] survey_01raw.csv
└───output1_part2.txt
└───[code] output_table2.do
└───[data] analysis_data02.csv
└───[code] MISSING FILE(S)
└───[data] admin_01raw.csv
9.1.3 Unused data sources
It is possible that not all data included in a replication package are actually used in code scripts in the reproduction package. This would be the case if, for example, the raw data and analysis data are included, but not the script that generates the analysis data. As a concrete example, consider what the original diagram above would look like if the only code included in the reproduction package were analysis.R:
table1.tex
|___[code] analysis.R
|___analysis_data.dta
Unused data sources:
raw_1.dta
raw_2.dta
raw_3.dta
raw_4.dta
Unused analysis data:
cleaned_1.dta
cleaned_2.dta
cleaned_3.dta
cleaned_4.dta
merged_1_2.dta
merged_3_4.dta
cleaned_1_2.dta
cleaned_3_4.dta
In this case, there are many data files that were listed in the raw data and analytic data spreadsheets that are not used by any code script in the replication package.
9.2 Examples from real reproduction attempts
9.2.1 Possibly missing code for producing a display item2.
This reproduction diagram fragment likely shows a missing piece of code. In a complete reproduction package there are no unused files and all final outputs are display items. cps2018.dta
and cps_march2017.dta
are included as inputs in the reproduction kit, but never used to make a display item, i.e., there is no piece of code that is listed as using them as inputs. Likewise, PublicSalary.dta
is listed as a final output, meaning it, too, is not used to make a display item since it is not listed as an input for any code script. Perhaps Paper2_PoExitDataset.dta
is made by a missing code script that takes PublicSalary.dta
, cps2018.dta
, and cps_march2017.dta
as inputs? It may be the case that the only way to find out for sure is to contact the study author(s), in which case, such a diagram can be used to help them identify any missing files.
PublicSalary.dta
|___MakeData2.do
|___Paper2_ProviderDataset.dta
|___Paper2_SSPDataset_wIRT_final.dta
|___PublicFacilitySurvey_clean.dta
Table1.xml
|___Table1.do
|___ProviderData.dta
| |___MakeData4.do
| |___Paper2_PoExitDataset.dta
| |___Paper2_ProviderDataset.dta
| |___Paper2_SSPDataset_wIRT_final.dta
|___VillageDataset.dta
| |___MakeData8.do
| |___Paper2_HouseholdDataset1.dta
| |___Paper2_VillageDataset.dta
|___HouseholdDataset.dta
|___MakeData8.do
|___Paper2_HouseholdDataset1.dta
|___Paper2_VillageDataset.dta
Unusued data sources:
cps2018.dta
cps_march2017.dta
9.2.2 Long, complicated tree3.
This diagram shows that the production of a given display item may be very complicated, highlighting the usefulness of the Diagram Builder as a visualization tool. In reproducing such a complicated display item, it can be useful to have such a diagram to determine, for example, the order in which code scripts should be run, what files might depend on a faulty code script, or which files are necessary to keep if the goal is to only produce the specific display item.
9.2.3 Example of completing reproduction tree
In many cases, some of the components of the workflow will not be easily identifiable (or missing) in the reproduction package. Here we present a more complex example that the one presented in the Asessement chaprer. The Diagram Builder will return a partial reproduction tree diagram. For example, if the files merge_1_2.do
, merge_3_4.do
, and final_merge.do
are missing from the previous diagram, the Diagram Builder will produce the following diagram:
cleaned_3.dta
└──[code] clean_raw_3.py
└──raw_3.dta
table1.tex
└──[code] analysis.R
└──analysis_data.dta
cleaned_3_4.dta
└──[code] clean_merged_3_4.do
└──merged_3_4.dta
cleaned_1.dta
└──[code] clean_raw_1.py
└──raw_1.dta
cleaned_2.dta
└──[code] clean_raw_2.py
└──raw_2.dta
cleaned_4.dta
└──[code] clean_raw_4.py
└──raw_4.dta
cleaned_1_2.dta
└──[code] clean_merged_1_2.do
└──merged_1_2.dta
Unused data sources: None.
In this case, you can still manually combine this partial information with your knowledge from the paper and own judgement to produce a “candidate” tree diagram (which might lead to different reproducers recreating different diagrams). This may look like the following:
table1.tex
└──[code] analysis.R
└──analysis_data.dta
└──MISSSING_CODE_FILE_3
└──cleaned_3_4.dta
| └──[code] clean_merged_3_4.do
| └──merged_3_4.dta
| └──MISSSING_CODE_FILE_2
| └──cleaned_3.dta
| | └──[code] clean_raw_3.py
| | └──raw_3.dta
| └──cleaned_4.dta
| └──[code] clean_raw_4.py
| └──raw_4.dta
└──cleaned_1_2.dta
└──[code] clean_merged_1_2.do
└──merged_1_2.dta
└──MISSSING_CODE_FILE_1
└──cleaned_1.dta
| └──[code] clean_raw_1.py
| └──raw_1.dta
|
└──cleaned_2.dta
└──[code] clean_raw_2.py
└──raw_2.dta
To leave a record of the reconstructed diagrams, you will have to amend the input spreadsheets using placeholders for the missing components. In the example above, you should add the following entries to the code description spreadsheet:
file_name | location | inputs | outputs | description | primary_type |
---|---|---|---|---|---|
… | … | … | … | … | … |
missing_file1 | unknown | cleaned_1.dta, cleaned_2.dta | merged_1_2.dta | missing code | unknown |
missing_file2 | unknown | cleaned_3.dta, cleaned_4.dta | merged_3_4.dta | missing code | unknown |
missing_file3 | unknown | merged_3_4.dta, merged_1_2.dta | analysis_data.dta | missing code | unknown |