Day 3 - R for Analysis

Math Session

Calculus I: Derivatives

Differentiation

Links

Lecture slides here
Problem set here

Software Session

Factors
Understanding Dataframes
Loading Packages
Importing and Exporting different types of Dataframes

Factors

Factors are used for categorical data - both nominal and ordinal ones.

Factors are treated as a separate datatype in R. Technically, factors are stored as a vector of integer values corresponding to the character type objects which they are used to represent.

You can define a factor by using factor() command.

vec_1 <- c("yes", "no", "yes")

fct_1 <- factor(c("yes", "no", "yes"))

# Notice the difference in outputs

vec_1

[1] "yes" "no"  "yes"

fct_1

[1] yes no  yes
Levels: no yes

# Btw, table() command cn be used in R for cross-tabulations
# with both vector and factor datatypes. 

table(vec_1)

vec_1
 no yes 
  1   2

table(fct_1)

fct_1
 no yes 
  1   2

Ordering Factors

Sometimes it is essential to specify the orders of your factor levels. Particularly during modelling and estimation with binary or categorical variables, given that the first level of factor is used in most functions, like lm(linear regression command in R), will be treated as baseline level or category.

For example, we have a variable measuring dose of vaccine administered (placebo, medium, high). Here specifying order becomes important as all measurements of the treatment efficacy will have to be with respect to the baseline category.

We use levels() within factor() command to do this.

fct_2 <- factor(c("High", "High", "Medium", "Medium", "High", "High","Placebo"))
fct_2

[1] High    High    Medium  Medium  High    High    Placebo
Levels: High Medium Placebo

# (Order is often determined using alphabetical variables by default) (H-M-P)

# Ordering it
fct_2 <- factor(c("Placebo", "High", "Medium", "Placebo", "Medium", 
            "Medium", "High", "High"),
levels = c("Placebo","Medium","High"))

fct_2

[1] Placebo High    Medium  Placebo Medium  Medium  High    High   
Levels: Placebo Medium High

Dataframes

In R, dataframes are data structure which store data in a tabular format.

We create dataframes using data.frame() command.

df_1 <- data.frame(
    Foo= 15:18, 
    Bar= c(T, F, T, T), 
    Name= c("Penny", "Sheldon", "Rajesh", "Leonard")
)

Exploring the contents and structure of the dataframe

# Viewing dataframe
df_1 # In Console
View(df_1)  # In Viewer

# structure of dataframe
str(df_1)

# Names of columns/variables
names(df_1)

# Dimesnions of Dataframe
nrow(df_1)      
ncol(df_1)
dim(df_1)

# Summary of dataframe
summary(df_1) # See the output closely | very useful for understanding the dataset

Accessing the objects inside dataframe

# Access Items using [] 
df_1[1]

# Access Items using [[]]
df_1[['Bar']]

[1]  TRUE FALSE  TRUE  TRUE

# Access Items using $
df_1$Bar

[1]  TRUE FALSE  TRUE  TRUE

# Access particular data point
df_1$Foo[3]

[1] 17

# what will be the output?
df_1[1,3]

[1] "Penny"

Tidyverse package has a very efficient framwork for working with dataset. Check the tidyverse book from day 2 for the same.

Packages

Packages in R are containers for functions. A lot of packages are already installed when you install R.

# Check available packages
library()

You can install packages from Comprehensive R Archive Network or CRAN which is an online storage of peer-reviewed and documented packages.

The command for loading a package is install.package().

# installing package. Eg, tidyverse

install.packages("tidyverse") # You have ti run this once on system

library(tidyverse) # Once installed library(<packagename>) command loads all the functions associated with the package in the current session for use

Exercise 1

Load the following four packages/libraries, which we would be using later - janitor, here, readstata13, and tinytex.

install.packages("<package name>")
library(<package name>)

Importing and Exporting Datasets

R has a range of functions for using different types of data. But before loading datasets let’s understand the concept of working directories.

A working directory is sort of the “office” that you operate from. They tell R where to operate from.

Working directories are specified using a file path i.e. the address in your computer where your script will be stored, or where your dataset is kept.

# Commands:
getwd() # Gets the present directory or pathway where you are operating from
     
setwd("<press tab here>") # Setting new directory as working directory

list.files()  #list the files in the working directory

Below is a limited list of commands for loading/importing most commonly used dataset types.

read.csv("FileName") # reads CSV files / press tab inside the quotes
read_csv("Pathname/filename.csv")  

# The part before :: in the following code refers to the package from where the 
#  function comes from. You will need to load those packages first.

readxl::read_excel() # read excel files
readxl::read_xlsx() # reads excel workbooks
haven::read_dta()   # reads stata dta files

# example: 
    
dataframe1 <- read.csv("<file path>")

Exercise 2
Code 2

Download the folder Datasets-mathcamp from the link
Load datasets using the functions referred above
Explore the contents of datasets using the functions we learned in the previous section.
Save these datasets with a different names at a different location.

Download the folder Datasets-mathcamp from the link
Load datasets using the functions referred above.

ANES dataset | American National Election Study (“2016 Time Series Study,” n.d.)

anes_df <- read.csv("Datasets-mathcamp/anes_specialstudy_2020-2022_socialmedia_csv_20230705/anes_specialstudy_2020-2022_socialmedia_csv_20230705.csv") # base R

anes_df_2 <- read_csv("Datasets-mathcamp/anes_specialstudy_2020-2022_socialmedia_csv_20230705/anes_specialstudy_2020-2022_socialmedia_csv_20230705.csv")  # tidyverse

World Political Cleavages and Inequality Database

wid_df <- readxl::read_excel("Datasets-mathcamp/World Inequality Database/gmp-macro-final-party.xlsx")

Database on Political Institutions

dpi20_df <- read_dta("Datasets-mathcamp/DPI/DPI2020_stata13.dta") # Why error?
dpi20_df_2 <- read.dta13("Datasets-mathcamp/DPI/DPI2020_stata13.dta")

VDem dataset | Varieties of democracy

vdem_df <- readRDS("Datasets-mathcamp/V-Dem-CY-Full+Others-v12.rds")

# RDS and Rdata are native R file storage formats

Explore the contents of datasets using the functions we learned in the previous section.

#Hint: summary, str

Save these datsets with a different name at a different location.

# Hint
# write.csv() and equivalents

# saveRDS and save for native R data struture types

The datasets we just practiced with are very commonly used across various subfields. The documentation is also included in the folder that we just downloaded.