Day 5 - File Management and Workflow

Software Session

  1. Using R Projects

  2. here package

  3. Folder Structure

  4. Takeaways

R Projects

We used the setwd() command till now to trace the files we need in our work. As your work expands, projects will have multiple datasets to be loaded, different subsidiary scripts to be used, and multiple outputs to be saved.

A first order problem related to both file management and reproducability of code is the usage of file paths. Using absolute paths, like ~/User/MyName/Documents/..... becomes cumbersome and also inhibits efficiency of reproducability. Every time someone else runs the script, they will have to change the file paths in all the instances in Rscripts or .qmd file to locate the related datasets as well as other objects. Similarly, there would be issues with saving objects in new places. A partially efficient way we used till now involved using setwd() to direct R to a new working directory; this is also called usage of relative paths

R Projects is a built-in mechanism in RStudio for seamless file management and usage of relative paths.

Let’s start by creating a new project. Click File > New Project. Name the new project govt-8001-dataessay.

Figure 1.1: To create new project: (top) first click New Directory, then (middle) click New Project, then (bottom) fill in the directory (project) name, choose a good subdirectory for its home and click Create Project. source

Exercise

  1. Do this process again, this time creating a new project in the the existing directory. The existing directory should be the folder where you have been saving R scripts and .qmd file associated with math camp 2024.

  2. Go to the folder on your system, and click the .RProj file.

  3. Start a new qmd file like we did yesterday. Delete existing code except for YAML. Run getwd() command in console and see the difference.

  4. Start a new R code chunk (cmd + option + I) and load vdem dataset. Notice the change in behavior when you press TAB inside the readRDS() function.

here package

An efficient file and folder management system is going to be crucial as we move into working with serious projects. As stressed earlier, keeping and using all the files associated with a project in a comprehensible folder system is facilitated by R Projects. You would ideally want to create your own template for folder management that you follow across proejcts. For starters, the folder structure below is the one created for your data essay assignment in Govt 8001 or Quant 1.

You can use the point-and-click fucntionality in your computers to create this strcuture. Later today, we will briefly go through an R script that do this programmatically.

πŸ“¦ govt-8001-dataessay
β”œβ”€ govt-8001-dataessay.RProj
β”œβ”€ 000-setup.R
β”œβ”€ 001-eda.qmd
β”œβ”€ 002-analysis.qmd
└─ 003-manuscript.qmd
β”œβ”€ Data
β”‚  β”œβ”€ Raw
β”‚  β”‚  β”œβ”€ Dataset1
β”‚  β”‚  β”‚  β”œβ”€ dataset1.csv
β”‚  β”‚  β”‚  β””─ codebook-dataset1.pdf
β”‚  β”‚  β””─ Dataset2
β”‚  β”‚     β”œβ”€ ...dta
β”‚  β”‚     β””─ codebook-dataset2.pdf
β”‚  β””─ Clean
β”‚     β””─ Merged-df1-df2.csv
β”œβ”€ Scripts
β”‚  β”œβ”€ R-scripts
β”‚  β”‚  β”œβ”€ plotting-some-variable.R
β”‚  β”‚  β””─ exploring-different-models.R
β”‚  β”œβ”€ Stata-Scripts
β”‚  β”‚  β””─ seeing-variable-labels.do
β”‚  β””─ Python-Scripts
β”‚     β””─ scraping-data-from-website.py
└─ Outputs
   β”œβ”€ Plots
   β”‚  β”œβ”€ ...jpeg
   β”‚  β””─ ...png
   β”œβ”€ Tables
   β”‚  β””─ .csv
   β””─ Text
      β””─ ...txt

Suggested folder structure for a Quant-1 project

While we learnt how to create or associate an .RProj with a folder, integrating it with here() function from the here package, makes things further smoother. Let’s do it with the following exercise.

Exercise

  1. Go the RStudio window with mathcamp2024 project. Check the extreme upper left corner to see if you are in the write window.

  2. In the qmd file we were working in, add an R chunk.

  3. Load the library here with the follwing code. Run the code line by line

library(here)


 # See the output for each of the following lines
here()

here("Datasets-mathcamp","V-Dem-CY-Full+Others-v12.rds")

# syntax is

# here("First subfolder from the root folder", "second subfolder",...., "file")


vdem_new <- readRDS(here("Datasets-mathcamp","V-Dem-CY-Full+Others-v12.rds"))

This is a cleaner syntax which when coupled with usage of R projects saves time in typing file paths and avoids issues when the project is run on some other computer system.

Note: here() always notes the path from the main folder or the root directory where your .RProj file is located.

Save the files and close the mathcamp2024 project window

Make it a habit of using R Prohects and here() function in your scripts for writing portable code.

You can read this quick and informative blogpost on using these two here.

Folder Structure

Let’s look at the other opened RStudio window. This is the one associated with govt-8001-dataessay.

We ideally want a folder structure that is easily understandable to us and others.

πŸ“¦ govt-8001-dataessay
β”œβ”€ govt-8001-dataessay.RProj
β”œβ”€ 000-setup.R
β”œβ”€ 001-eda.qmd
β”œβ”€ 002-analysis.qmd
└─ 003-manuscript.qmd
β”œβ”€ Data
β”‚  β”œβ”€ Raw
β”‚  β”‚  β”œβ”€ Dataset1
β”‚  β”‚  β”‚  β”œβ”€ dataset1.csv
β”‚  β”‚  β”‚  β””─ codebook-dataset1.pdf
β”‚  β”‚  β””─ Dataset2
β”‚  β”‚     β”œβ”€ ...dta
β”‚  β”‚     β””─ codebook-dataset2.pdf
β”‚  β””─ Clean
β”‚     β””─ Merged-df1-df2.csv
β”œβ”€ Scripts
β”‚  β”œβ”€ R-scripts
β”‚  β”‚  β”œβ”€ plotting-some-variable.R
β”‚  β”‚  β””─ exploring-different-models.R
β”‚  β”œβ”€ Stata-Scripts
β”‚  β”‚  β””─ seeing-variable-labels.do
β”‚  β””─ Python-Scripts
β”‚     β””─ scraping-data-from-website.py
└─ Outputs
   β”œβ”€ Plots
   β”‚  β”œβ”€ ...jpeg
   β”‚  β””─ ...png
   β”œβ”€ Tables
   β”‚  β””─ .csv
   β””─ Text
      β””─ ...txt

We can create this structure by using point and click system on our laptops. But since we might want to use the same folder structure repetitively it will make sense to be lazy and do it programmatically.

Exercise

  1. Download the 000-setup.R from here

  2. Place it in the govt-8001-dataessay folder.

  3. Open it in the opened RStudio window.

```{r}
# Name: 000-setup.R
# Author: Parushya
# Purpose: Creates main folders, subfolders in the main project directory
# Will also ensure that you have basic packages required to run the repository
# Date Created: 2020/10/07



# Checking if packages are installed and installing


# check.packages function: install and load multiple R packages.
# Found this function here: https://gist.github.com/smithdanielle/9913897 on 2019/06/17
# Check to see if packages are installed. Install them if they are not, then load them into the R session.

check.packages <- function(pkg) {
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg)) {
    install.packages(new.pkg, dependencies = TRUE)
  }
  sapply(pkg, require, character.only = TRUE)
}

# Check if packages are installed and loaded:
packages <- c("janitor",  "tidyverse", "utils", "here")
check.packages(packages)


# Setting Directories and creating subfolders


# Creating Sub Folders

## Data
dir.create(file.path(paste0(here("Data")))) # Data Folder
dir.create(file.path(paste0(here("Data","Raw")))) # Raw Data sub-folder
dir.create(file.path(paste0(here("Data","Clean")))) # Clean Data sub-folder


# Scripts
dir.create(file.path(paste0(here("Scripts")))) # Scripts Folder
dir.create(file.path(paste0(here("Scripts","RScripts")))) # RScripts  sub-folder
dir.create(file.path(paste0(here("Scripts","Stata-Scripts")))) # Stata Scripts sub-folder
dir.create(file.path(paste0(here("Scripts","Python-Scripts")))) # Python Scripts sub-folder


# Output
dir.create(file.path(paste0(here("Outputs")))) # Outputs Folder
dir.create(file.path(paste0(here("Outputs","figures")))) # Figures sub-folder
dir.create(file.path(paste0(here("Outputs","tables")))) # Tables sub-folder
dir.create(file.path(paste0(here("Outputs","text")))) # Text sub-folder

```
  1. Run the file line-by-line. See the folder structure created in your main folder.

Takeaways

Here’s a quick workflow for starting a new project or assignment or paper.

  1. Make a new folder in your computer with apt name. Ideally, govt-<coursecode>-<project>.

  2. Start RStudio.

  3. Create a new Rstudio Project by clicking File > New Project. Name it govt-<coursecode>-<project.

  4. Check if now your RStudio Window shows the project name on top right corner. If not, go to folder and double-click the .RProj file.

  5. Paste the 000-setup.R file in the main project folder. Open it in the same Rstudio window with the project and run the complete file. Your folder structure is created.

  6. Copy your raw data in Data/Raw folder. Similarly, your scripts in Scripts/RScripts folder

  7. Start your new .qmd file and save it in the main folder.

  8. Remember to use here() package extensively in both scripts and quarto file when loading or saving the data.

  9. You can always zip the whole project folder for sharing. The receiver will just need to unzip and run the code after starting the associated .RProj file, without changing file paths on their computer.