Chapter 9 Examples of Reproduction Trees

A diagram generated by the Diagram Builder which represents all the available data and code on behind a specific display item. The tree is meant to represent the entire computational workflow behind a result from the paper. It allows reproducers to trace a display item to its primary sources. It can also be used to guide users of the reproduction package and/or to identify missing components for a complete reproduction.

A reproduction tree is complete when it is possible to connect its output (a given display item) with all of its inputs down to the raw data. A reproduction is incomplete when it is not possible to connect all the inputs to the resulting display item. Paraphrasing the author Leo Tolstoy, complete workflows are all alike; every incomplete computational workflow is incomplete in its own way. This chapter presents a few examples of reproduction trees, focusing particularly on the many possible ways in which a tree could be incomplete. If you have a reproduction tree that contains an instructive example please contribute to this chapter (via a pull request or emailing your reproduction tree to ACRE@berkeley.edu.)

9.1 Stylized examples

9.1.1 Complete reproduction tree

Below is an example of output from the Diagram Builder for a display item that can be fully constructed using the files contained in the reproduction package. The diagram displays files as outputs and inputs to code scripts all the way down to the raw data.

          table 1
            └───[code] formatting_table1.R
                ├───output1_part1.txt  
                |   └───[code] output_table1.do           
                |       └───[data] analysis_data01.csv
                |          └───[code] data_cleaning01.R
                |             └───[data] survey_01raw.csv
                └───output1_part2.txt  
                    └───[code] output_table2.do           
                        └───[data] analysis_data02.csv
                           └───[code] data_cleaning02.R
                              └───[data] admin_01raw.csv

9.1.2 Incomplete reproduction tree

9.1.2.1 Raw data and analytic data are available, but cleaning code is missing.

Below is an example of output from the Diagram Builder for a display item that is missing some of the code needed to generate it from the raw data. There are two reasons to suspect that this workflow is incomplete: (i) there is no clear data cleaning step (only analysis that generates output and formatting), and (ii) there are unused files that are likely to be raw data. None of these reasons can confirm unequivocally that the tree is incomplete, but a reproducer familiar with the paper and its data sources could use the tree to certify its (in)completeness and request missing files.

          table 1
            └───[code] formatting_table1.R
                ├───output1_part1.txt  
                |   └───[code] output_table1.do           
                |       └───[data] analysis_data01.csv
                └───output1_part2.txt  
                    └───[code] output_table2.do           
                        └───[data] analysis_data02.csv

           Unused files: 
           - survey_01raw.csv
           - admin_01raw.csv

Reproducers are asked to speculate on where the missing files might go, and hence propose how a complete tree might look like (where possible). For this example, we have assumed there are missing code scripts that at some point take in survey_01raw.csv and admin_01raw.csv, and eventually output analysis_data01.csv and analysis_data02.csv, though this requires the reproducer’s discretion.

          table 1
            └───[code] formatting_table1.R
                ├───output1_part1.txt  
                |   └───[code] output_table1.do           
                |       └───[data] analysis_data01.csv
                |          └───[code] MISSING FILE(S)
                |             └───[data] survey_01raw.csv
                └───output1_part2.txt  
                    └───[code] output_table2.do           
                        └───[data] analysis_data02.csv
                           └───[code] MISSING FILE(S)
                              └───[data] admin_01raw.csv

9.1.3 Unused data sources

It is possible that not all data included in a replication package are actually used in code scripts in the reproduction package. This would be the case if, for example, the raw data and analysis data are included, but not the script that generates the analysis data. As a concrete example, consider what the original diagram above would look like if the only code included in the reproduction package were analysis.R:

        table1.tex
            |___[code] analysis.R
                |___analysis_data.dta

        Unused data sources:
        raw_1.dta
        raw_2.dta
        raw_3.dta
        raw_4.dta

        Unused analysis data:
        cleaned_1.dta
        cleaned_2.dta
        cleaned_3.dta
        cleaned_4.dta
        merged_1_2.dta
        merged_3_4.dta
        cleaned_1_2.dta
        cleaned_3_4.dta

In this case, there are many data files that were listed in the raw data and analytic data spreadsheets that are not used by any code script in the replication package.

9.1.4 Final outputs is not a display item

9.2 Examples from real reproduction attempts

9.2.1 Possibly missing code for producing a display item².

This reproduction diagram fragment likely shows a missing piece of code. In a complete reproduction package there are no unused files and all final outputs are display items. cps2018.dta and cps_march2017.dta are included as inputs in the reproduction kit, but never used to make a display item, i.e., there is no piece of code that is listed as using them as inputs. Likewise, PublicSalary.dta is listed as a final output, meaning it, too, is not used to make a display item since it is not listed as an input for any code script. Perhaps Paper2_PoExitDataset.dta is made by a missing code script that takes PublicSalary.dta, cps2018.dta, and cps_march2017.dta as inputs? It may be the case that the only way to find out for sure is to contact the study author(s), in which case, such a diagram can be used to help them identify any missing files.

        PublicSalary.dta
        |___MakeData2.do
            |___Paper2_ProviderDataset.dta
            |___Paper2_SSPDataset_wIRT_final.dta
            |___PublicFacilitySurvey_clean.dta

        Table1.xml
        |___Table1.do
            |___ProviderData.dta
            |   |___MakeData4.do
            |       |___Paper2_PoExitDataset.dta
            |       |___Paper2_ProviderDataset.dta
            |       |___Paper2_SSPDataset_wIRT_final.dta
            |___VillageDataset.dta
            |   |___MakeData8.do
            |       |___Paper2_HouseholdDataset1.dta
            |       |___Paper2_VillageDataset.dta
            |___HouseholdDataset.dta
                |___MakeData8.do
                    |___Paper2_HouseholdDataset1.dta
                    |___Paper2_VillageDataset.dta

        Unusued data sources:
        cps2018.dta
        cps_march2017.dta

9.2.2 Long, complicated tree³.

This diagram shows that the production of a given display item may be very complicated, highlighting the usefulness of the Diagram Builder as a visualization tool. In reproducing such a complicated display item, it can be useful to have such a diagram to determine, for example, the order in which code scripts should be run, what files might depend on a faulty code script, or which files are necessary to keep if the goal is to only produce the specific display item.

9.2.3 Example of completing reproduction tree

In many cases, some of the components of the workflow will not be easily identifiable (or missing) in the reproduction package. Here we present a more complex example that the one presented in the Asessement chaprer. The Diagram Builder will return a partial reproduction tree diagram. For example, if the files merge_1_2.do, merge_3_4.do, and final_merge.do are missing from the previous diagram, the Diagram Builder will produce the following diagram:

        cleaned_3.dta
            └──[code] clean_raw_3.py
                └──raw_3.dta

        table1.tex
            └──[code] analysis.R
                └──analysis_data.dta

        cleaned_3_4.dta
            └──[code] clean_merged_3_4.do
                └──merged_3_4.dta

        cleaned_1.dta
            └──[code] clean_raw_1.py
                └──raw_1.dta

        cleaned_2.dta
            └──[code] clean_raw_2.py
                └──raw_2.dta

        cleaned_4.dta
            └──[code] clean_raw_4.py
                └──raw_4.dta

        cleaned_1_2.dta
            └──[code] clean_merged_1_2.do
                └──merged_1_2.dta
        Unused data sources: None.

In this case, you can still manually combine this partial information with your knowledge from the paper and own judgement to produce a “candidate” tree diagram (which might lead to different reproducers recreating different diagrams). This may look like the following:

        table1.tex
            └──[code] analysis.R
                └──analysis_data.dta
                    └──MISSSING_CODE_FILE_3
                        └──cleaned_3_4.dta
                        |       └──[code] clean_merged_3_4.do
                        |           └──merged_3_4.dta
                        |               └──MISSSING_CODE_FILE_2
                        |                   └──cleaned_3.dta
                        |                   |       └──[code] clean_raw_3.py
                        |                   |           └──raw_3.dta    
                        |                   └──cleaned_4.dta
                        |                           └──[code] clean_raw_4.py
                        |                               └──raw_4.dta
                        └──cleaned_1_2.dta
                                └──[code] clean_merged_1_2.do
                                    └──merged_1_2.dta
                                        └──MISSSING_CODE_FILE_1
                                            └──cleaned_1.dta
                                            |       └──[code] clean_raw_1.py
                                            |           └──raw_1.dta
                                            |   
                                            └──cleaned_2.dta
                                                    └──[code] clean_raw_2.py
                                                        └──raw_2.dta

To leave a record of the reconstructed diagrams, you will have to amend the input spreadsheets using placeholders for the missing components. In the example above, you should add the following entries to the code description spreadsheet:

Table 9.1: Adding rows to code spreadsheet
file_name	location	inputs	outputs	description	primary_type
…	…	…	…	…	…
missing_file1	unknown	cleaned_1.dta, cleaned_2.dta	merged_1_2.dta	missing code	unknown
missing_file2	unknown	cleaned_3.dta, cleaned_4.dta	merged_3_4.dta	missing code	unknown
missing_file3	unknown	merged_3_4.dta, merged_1_2.dta	analysis_data.dta	missing code	unknown

This is from a reproduction attempt conducted as part of a UC Berkeley Development Economics course↩︎
This is from a reproduction attempt conducted as part of a UC Berkeley Development Economics course↩︎