Reproducible Research

Why Programming or Coding?

There are a lot of fancy answers to it. But the key idea is that you want to be lazy about repetitive tasks (MBAs call it being “productive”).

Many tasks - data cleaning, wrangling, visualization, and statistical analysis - require you to do them many times. Moreover, you would want to be able to reproduce and replicate your thinking about all of the tasks mentioned above on many different datasets and sometimes even on the same dataset after some time.

Coding is about formalizing your thinking about how you treat the data and automating the formalization task to be done repetitively. It improves efficiency, enhances reproducibility, and boosts creativity when it comes to finding new patterns in your data.

Benchmarks for reproducible data and statistical analyses:1

  1. Accuracy: Write a code that reduces the chances of making an error and lets you catch one if it occurs.
  2. Efficiency: If you are doing it twice, see the pattern of your decision-making and formalize it in your code. Difference between Excel and coding
  3. Replicate-and-Reproduce: Ability to repeat the computational process which reflects your thinking and decisions that you took along the way. Improves transparency and forces one to be deliberate and responsible about choices during analyses.
  4. Human Interpretability: Writing code is not just about analyzing but allowing yourself and then others to be able to understand your analytic choices.
  5. Public Good: Research is a public good. And the code allows your research to be truly accessible. This means you write a code that anyone else who understands the language can read, reuse, and recreate without you being present. We essentially ensure that by writing a readable and ideally publicly accessible code.

Guidelines

The article “Ten Simple Rules for Reproducible Computational Research” by Sandve et al. (2013) provides guidelines to ensure that computational research is reproducible, transparent, and robust. Here’s a summary of the key points:

Rule Description Notes
Documentation Track how results are produced, including all steps in the analysis workflow. Keep short notes on reults
Automation Minimize manual data manipulation by using scripts and documenting any manual changes. Make changes to raw data in your scripts
Version Control Use version control systems for all custom scripts to track changes and maintain reproducibility. Using Github
Comprehensive Records Archive all versions of external programs used, all intermediate results, and exact observation conditions. Keep notes about data in comments
Accessibility Make raw data, scripts, and results publicly accessible to enhance transparency and replication. Maintainig good workflow

Annotatating Code

A comment should explain the purpose of a command or code and not just be a description of what it does.

In R, comments are designated by a # (pound) sign

```{r}
x <- rnorm(100)  # generating data
y <- x + rnorm(100, mean=50, sd=0.1)  # creating y
plot(x, y)  # plotting x against y
m <- lm(y ~ x)  # linear model
summary(m)  # summary of model
```
```{r}
# Generate a sample of 100 random numbers from a standard normal distribution
x <- rnorm(100)

# Create a dependent variable 'y' with a strong linear relationship plus small random noise
y <- x + rnorm(100, mean=50, sd=0.1)

# Plot 'x' against 'y' to visualize the relationship
plot(x, y, main="Scatterplot of Y against X", xlab="X variable", ylab="Y variable")

# Fit a linear regression model to predict 'y' based on 'x'
model <- lm(y ~ x)

# Display a detailed summary of the regression model
summary(model)

```


In STATA Comments may be added to programs in three ways:
* begin the line with ;
begin the comment with //;
* place the comment between /* and */ delimiters.

```{stata}
sysuse auto
gen z = price + weight
regress z mpg

```
```{stata}
// Load the built-in 'auto' dataset from Stata's system files
sysuse auto, clear

// Generate a new variable 'z' representing the sum of 'price' and 'weight'
gen z = price + weight

// Perform a linear regression of 'z' on 'mpg' to understand the relationship between mileage and the new variable
regress z mpg

```


File Management and Workflow

Understanding Absolute and Relative Paths

When working with files in any programming environment, paths specify the location of files and folders. These paths can be absolute or relative, and the choice between them significantly impacts reproducibility, portability, and ease of collaboration

Absolute Paths An absolute path provides the complete address of a file or folder, starting from the root directory of the file system. It tells the software exactly where to find a file, regardless of where the script is run.

Example: C:/Users/YourName/Documents/Project/Data/raw_data.dta

Relative Paths A relative path specifies the location of a file or folder relative to a “base directory” (e.g., the project’s working directory). It does not start from the root directory but instead is calculated based on the location of the script.

Suppose your working directory is set to: C:/Users/YourName/Documents/Project

Then, a relative path might look like: Data/raw_data.dta

Practical Analogy Think of absolute and relative paths like giving directions to a house:

Absolute Path: “Go to the main city square, then take the highway north, turn right at the first traffic light, and find the house at 123 Main Street.” Works for people starting anywhere, but requires detailed instructions specific to the city. Relative Path: “From the library, walk two blocks north, then turn left. The house is the second one on the right.” Simpler and context-aware, but assumes everyone starts from the library.

Key Differences Between Absolute and Relative Paths

Feature Absolute Path Relative Path
Starting Point Starts from the root directory of the file system. Starts from the current working directory.
Portability Not portable—specific to the user’s system. Highly portable—adapts to different systems.
Ease of Sharing Harder to share; others must update paths. Easier to share; no changes needed if structure is consistent.
Use Case Best for fixed environments or one-off scripts. Ideal for collaborative and reproducible projects.
Flexibility Breaks if the file is moved or the system changes. Adapts as long as the folder structure remains consistent.

Standardized Folder and File Structure

An efficient file and folder management system is going to be crucial as we move into working with serious projects. Storing using all the files associated with a project in a comprehensible folder system is facilitated in both R and Stata. You would ideally want to create your own template for folder management that you follow across proejcts. For starters, the folder structure below is the one created for your research project in this course.

You can use the point-and-click functionality in your computers to create this structure. Or you can do it prgrammatically given the scripts in sub-chapters for stata and R.

📦 cpe-4980-dataessay
├─ cpe-4980-dataessay.RProj
├─ 000-setup.R
├─ 001-eda.qmd
├─ 002-analysis.qmd
└─ 003-manuscript.qmd
├─ Data
│  ├─ Raw
│  │  ├─ Dataset1
│  │  │  ├─ dataset1.csv
│  │  │  └─ codebook-dataset1.pdf
│  │  └─ Dataset2
│  │     ├─ ...dta
│  │     └─ codebook-dataset2.pdf
│  └─ Clean
│     └─ Merged-df1-df2.csv
├─ Scripts
│  ├─ R-scripts
│  │  ├─ plotting-some-variable.R
│  │  └─ exploring-different-models.R
│  ├─ Stata-Scripts
│  │  └─ seeing-variable-labels.do
│  └─ Python-Scripts
│     └─ scraping-data-from-website.py
└─ Outputs
   ├─ Plots
   │  ├─ ...jpeg
   │  └─ ...png
   ├─ Tables
   │  └─ .csv
   └─ Text
      └─ ...txt
Key Takeaways for Reproduciblility
  1. Any modifications to the raw dataset, like manipulating a variable’s scale, generation of new variables, or removal of values, should be done in the scripts as far as possible. When done externally, make a note with comments in the code.

  2. Write well commented code. Explaining the functions performed by the commands in the context of your data analysis.

  3. Use relative paths.

  4. Use a standardized folder system.


  1. Inspired by the summary provided by Prof Aaron Williams’ course on Data Analysis offered at McCourt School. Strongly recommended to learn good coding using R↩︎