Chapter 5 Checking for Robustness

Once you have assessed and potentially improved the computational reproducibility of the display items for a claim within a paper, you can determine these results’ robustness by modifying some analytic choices and reporting their subsequent effects on the estimates of interest, i.e., conducting robustness checks. The universe of robustness checks can be very large (potentially infinite!) and pertain to data analysis and data cleaning. The SSRP distinguishes between reasonable and feasible robustness checks.

Reasonable robustness checks (Simonsohn et. al., 2018) are (i) sensible tests of the research question, (ii) expected to be statistically valid, and (iii) not redundant with other specifications in the set. The set of feasible robustness checks is defined by all the specifications that can be computationally reproduced. We assume that the specifications already published in the paper are part of the set of reasonable specifications.

Universe of robustness tests and its elementsUniverse of robustness tests and its elementsUniverse of robustness tests and its elementsUniverse of robustness tests and its elements

Figure 5.1: Universe of robustness tests and its elements

The size of the set of feasible robustness checks, and the likelihood that it contains reasonable specifications, will depend on the current level of reproducibility of the results that support a claim. The idea is illustrated in Figure 5.1. At Levels 1-2, it is impossible to perform additional robustness checks because there are no data to work with. It may be possible to perform additional robustness checks for claims supported by display items reproducible at Levels 3-4, but not using the estimates declared in Scoping (because the display items are not computationally reproducible from analysis data). It is possible to conduct additional robustness checks to validate the core conclusions of a claim based on a display item reproducible at Level 5. Finally, claims associated with display items reproducible at Level 6 or higher allow for robustness checks that involve variable definitions and other types of analytical choices.

The number of feasible robustness checks grows exponentially with improved reproducibility. For example, when checking the robustness of a new variable definition, you could test alternative variable definitions and changes in the main estimate under an alternative variable definition.

Robustness is assessed at the claim level (see the diagram with a typical paper’s components 0.3). There may be several specifications presented in the paper for a given claim, one of which the authors (or you, if the authors did not indicate) have chosen as the main or preferred specification. Identify which display item contains this specification and refer to the reproduction tree to identify the code files in which you can modify a computational choice. Using the example tree discussed in the Assessment stage, we can obtain the following (we removed the data files for simplicity). This simplified tree provides a list of potential files in which you can test different specifications:

        table1.tex (contains preferred specification of a given claim)
            |___[code] analysis.R
                    |___[code] final_merge.do
                            |___[code] clean_merged_1_2.do
                            |       |___[code] merge_1_2.do
                            |               |___[code] clean_raw_1.py
                            |               |___[code] clean_raw_2.py
                            |___[code] clean_merged_3_4.do
                                    |___[code] merge_3_4.do
                                            |___[code] clean_raw_3.py
                                            |___[code] clean_raw_4.py

Here we suggest two types of contributions to robustness checks: (1) increasing the number of feasible robustness checks by identifying key analytical choices in code scripts and (2) justifying and testing reasonable specifications within the set of feasible checks. Both contributions should be recorded on the SSRP Platform and be linked to specific files in the reproduction package.

5.1 Feasible robustness checks: increasing the number of feasible specifications

Increasing the number of feasible robustness checks requires identifying the specific line(s) in the code scripts that execute an analytical choice. An advantage of this type of contribution is that you don’t need to have an in-depth knowledge of the paper and its methodology to contribute. This allows you to potentially map several code files, achieve a broader understanding of the paper, and build on top of others’ work. The disadvantage is that you are not expected to test and justify the reasonableness of alternative specifications.

Analytical choices can include those behind data cleaning and data analysis. Below are some proposed types for each category.

Analytical choices in data cleaning code

  • Variable definition
  • Data sub-setting
  • Data re-shaping (merge, append, long/gather, wide/spread)
  • Others (specify as “processing - other”)

Analytical choices in analysis code

  • Regression function (link function)
  • Key parameters (tuning, tolerance parameters, and others)
  • Controls
  • Adjustment of standard errors
  • Choice of weights
  • Treatment of missing values
  • Imputations
  • Other (specify as “methods - other”)

To record a specific analytical choice on the SSRP, please follow these steps:

  1. Review a specific code file (e.g. clean_merged_1_2.do) and identify an analytical choice (e.g. regress y x if gender == 1).

  2. Record the file name, line number, reproduction package (original or name of revised version), choice type, and choice value. For the source field, type “original” whenever the analytical choice is identified for the first time, and line number each time the same analytical choice is applied thereafter (for example, if an analytical choice is identified for the first time in line #103 and for the second time in line #122 their respective values for the source field should be original and 103). For each analytical choice recorded, add the specific choice used in the paper and, optionally, describe what alternatives could have been used. The resulting database would look like this:

entry_id file_name line_number choice_type choice_value choice_range Source
1 code_01.do 73 data sub-setting males males, female, original
2 code_01.do 122 variable definition income = wages + capital gains wages, capital gains, gifts “code_01.do-L103”
3 code_05.R 143 controls age, income, education age, income, education, region original

5.2 Justifying and testing reasonable robustness checks

Justifying and testing a specific analytical choice involves identifying a feasible analytical choice, conducting a variation on it, and justifying its reasonableness. This approach’s advantage is that it allows for an in-depth inspection of a specific section of the paper. Its main limitation is that justifying sensibility and validity (and non-redundancy, to an extent) requires a deeper understanding of the paper’s topic and the methods. That may mean that undergraduate students or graduate students with only a surface-level interest in the paper (or limited time) may find it challenging to conduct this part of the reproduction.

When performing a specific robustness check, follow these steps:

  1. Eventually, you will be able to search the SSRP database of feasible robustness checks (discussed above) and record the identifier(s) corresponding to the analytical choice(s) of interest (entry_id), but you can skip this for now. You can create an Entry ID by simply numbering the feasible specifications in the order as you record them.

  2. Propose a specific variation to this analytical choice.

  3. Discuss whether you think this variation is sensible, specifically in the context of the claim tested (e.g., does it make sense to include or exclude low-income Hispanic people from the sample when assessing the impact of a large wave of new immigrants?).

  4. Discuss how this variation could affect the validity of the results (e.g., likely effects on the omitted variable bias, measurement error, change in the Local Average Treatment Effects for the underlying population).

  5. Confirm that test is not redundant with other tests in the paper or the robustness exercise.

  6. Report the result of the robustness check (new estimate, standard error, and units).