Chapter 10 Definitions

10.1 Concepts in reproducibility

  • Analytic data – Data used as the final input in a workflow in order to produce a statistic displayed in the paper (including appendices).
  • Claim (concept) – A major hypothesis in a paper, whose results are presented in one or more display items. [ALEKS/FERNANDO]
    • Causal claim – An assertion that invokes causal relationships between variables. A paper may estimate the effect of X on Y for population P, using method F. Example: “This paper investigates the impact of bicycle provision on secondary school enrollment among young women in Bihar/India, using a Difference in Difference approach.”
    • Descriptive/predictive claim – A paper with such kind of a claim estimates the value of Y (estimated or predicted) for population P under dimensions X using method M. Example: “Drawing on a unique Swiss data set (population P) and exploiting systematic anomalies in countries’ portfolio investment positions (method M), I find that around 8% of the global financial wealth of households is held in tax havens (value of Y).”
  • Coding error – A coding error will occur when a section of the code, of the reproduction package, executes a procedure that is in direct contradiction with the intended procedure expressed in the documentation (paper or comments of the code). For example an error happens if the paper specify that the analysis is perform on the population of males, but the code restricts the analysis to females only. Please follow the ACRE procedure to report coding errors. [ALEKS/FERNANDO]
  • Data availability statement – A description, normally included in the paper, of the terms of use for data used in the paper, as well as the procedure to obtain the data (especially important for restricted-access data). Data availaibility statements expand on and complement data citations. Find guidance on data availability statements for reproducibility here.
  • Data citation – The practice of citing a dataset, rather than just the paper in which a dataset was used. This helps other researchers find data, and rewards researchers who share data. Find guidance on data citation here.
  • Data sharing – Making the data used in an analysis widely available to others, ideally through a trusted public repository/archive.
  • Disclosure – In addition to publicly declaring all potential conflicts of interest, researchers should detail all the ways in which they test a hypothesis, e.g., by including the outcomes of all regression specifications tested. This can be presented in appendices or supplementary material if room is limited in the body of the text.
  • Intermediate data – Data not directly used as final input for analyses presented in the final paper (including appendices). Intermediate data should not contain direct identifiers.
  • Literate programming – Writing code to be read and easily understood by a human. This best practice can make a researcher’s code more easily reproducible.
  • Pre-specification – The act of detailing the method of analysis before actually beginning data analysis.
  • Processed data – Raw data that have gone through any transformation other than the removal of PII.
  • Raw data – Unmodified data files obtained by the authors from the sources cited in the paper. Data from which personally identifiable information (PII) has been removed are still considered raw. All other modifications to raw data make it processed.
  • (Trial) registry – A database of registered studies or trials, for example the AEA RCT Registry or clinicaltrials.gov. Some of the largest registries only accept randomized trials, hence the frequent discussion of ‘trial registries. Registration is the act of publicly declaring that a hypothesis is being, has been, or will be tested, regardless of publication status. Registrations are time-stamped.
  • Replication – Conducting an existing research project again. A subtle taxonomy exists and there is disagreement, as explained in Hamermesh, 2007 and Clemens, 2015. Pure Replication, Reproduction, or Verification entails re-running existing code, with error-checking, on the original dataset to check if the published results are obtained. Scientific Replication entails attempting to reproduce the published results with a new sample, either with the same code or with slight variations on the original analysis.
  • Reproducibility – A research paper or a specific display item (an estimate, a table, or a graph) included in a research paper is reproducible if it is possible to reproduce within a reasonable margin of error (generally 10%) using the data, code, and materials made available by the author. Computational reproducibility is assessed through the process of reproduction.
  • Reproduction package – A collection of all the materials associated with the reproduction of a paper. A reproduction package may contain data, code and documentation. When the materials are provided in the original publication they will be labeled as ‘original reproduction package’, when they provided by a previous reproducer they will be referred as ‘reproducer X’s reproduction package’. At this point you are only assessing the existence of one (or more) reproduction packages, you will not be assessing the quality of its content at this stage.
  • Researcher degrees of freedom – The flexibility a researcher has in data analysis, whether consciously abused or not. This can take a number of forms, including specification searching, covariate adjustment, or selective reporting.
  • Robustness check: – Any possible change in a computational choice, both in data analysis and data cleaning, and its subsequent effect on the main estimates of interest. In the context of ACRE, the focus should be on the set of reasonable specifications (Simonsohn et. al., 2018), defined as (1) sensible tests of the research question, (2) expected to be statistically valid, and (3) not redundant with other specifications in the dataset.
  • Reasonable specification – [ALEKS/FERNANDO]
  • Specification – [ALEKS/FERNANDO]
  • Specification searching – Searching blindly or repeatedly through data to find statistically significant relationships. While not necessarily inherently wrong, if done without a plan or without adjusting for multiple hypothesis testing, test statistics and results no longer hold their traditional meaning, can result in false positives, and thus impede replicability.
  • Trusted digital repository – An online platform where data can be stored such that it is not easily manipulated, and will be available into the foreseeable future. Storing data here is superior to simply posting on a personal website since it is more easily accessed, less easily altered, and more permanent.
  • Version control – The act of tracking every change made to a computer file. This is quite useful for empirical researchers who may edit their programming code often.

10.2 Concepts in the ACRE exercise and the platform

  • Analysis code – A script associated primarily with analysis. Most of its content is dedicated to actions like running regressions, running hypothesis tests, computing standard errors, and imputing missing values.
  • Candidate paper – A paper that has been considered for reproduction, but the reproducer decided not to move forward with the analysis due to failure to locate a reproduction package. Learn more here.
  • Cleaning code – A script associated primarily with data cleaning. Most of its content is dedicated to actions like deleting variables or observations, merging data sets, removing outliers, or reshaping the structure of the data (from long to wide, or vice versa).
  • Declared paper – The paper that the reproducer analyzes throughout the exercise.
  • Display item – A display item is a figure or table that presents results described in the paper. Each display item contains several specifications. [ALEKS/FERNANDO]
  • Reproduction tree/ diagram – A diagram generated by the ACRE Diagram Builder which represents all the available data and code on behind a specific display item. The tree is meant to represent the entire computational workflow behind a result from the paper. It allows reproducers to trace a display item to its primary sources. It can also be used to guide users of the reproduction package and/or to identify missing components for a complete reproduction.
  • Revised reproduction package – [ALEKS/FERNANDO]