Day 2 - Basics

Math Session

Linear Algebra

  1. Vector and Matrices

  2. System of equations

Links

Software Session

Why Programming or Coding?

There are a lot of fancy answers to it. But the key idea is that you want to be lazy about repetitive tasks (MBAs call it being “productive”).

Many tasks - data cleaning, wrangling, visualization, and statistical analysis - require you to do them many times. Moreover, you would want to be able to reproduce and replicate your thinking about all of the tasks mentioned above on many different datasets and sometimes even on the same dataset after some time.

Coding is about formalizing your thinking about how you treat the data and automating the formalization task to be done repetitively. It improves efficiency, enhances reproducibility, and boosts creativity when it comes to finding new patterns in your data.

Guidelines for data and statistical analyses:1

  1. Accuracy: Write a code that reduces the chances of making an error and lets you catch one if it occurs.
  2. Efficiency: If you are doing it twice, see the pattern of your decision-making and formalize it in your code. Difference between Excel and coding
  3. Replicate-and-Reproduce: Ability to repeat the computational process which reflects your thinking and decisions that you took along the way. Improves transparency and forces one to be deliberate and responsible about choices during analyses.
  4. Human Interpretability: Writing code is not just about analyzing but allowing yourself and then others to be able to understand your analytic choices.
  5. Public Good: Research is a public good. And the code allows your research to be truly accessible. This means you write a code that anyone else who understands the language can read, reuse, and recreate without you being present. We essentially ensure that by writing a readable and ideally publicly accessible code.

R and RStudio

R is a free open-source statistical programming language. We generally use R through RStudio which is an integrated development environment (IDE). Essentially, it is the graphic user interface that allows us to use R efficiently. It has point-and-click functionality also (which we would not use a lot).

RStudio Screen

RStudio Screen

R Scripts: This is where put your code in a script. The script is saved with a .R extension. An R script is a text file that you can read on text editors too. We use RStudio to run the code in a way that the computer understands.
Console: Output from your code appears here. You can also write the code directly here. But it does not get saved. Also, by default, it shows only a limited number of previous steps (commands + outputs). Not a good practice to code here.
Environment: All the objects, datasets, lists, etc that you have created/loaded in the environment appear here. Alongside, you also see the custom functions that you might create.
File Browser/Help/Plot: Internal file navigator and help documentation for packages and functions appear here. Further, when you plot anything, that also gets shown here.
Comments: R interprets every line in the script as a separate command. And it does for each line unless preceded by a #. Comments signal to R that what follows the # is to be ignored.We use comments to write explanatory notes about the code. A comment should explain the purpose of a command or code and not just be a description of what it does.

Basics

R uses <- as the assignment operator. To the left of it is an object (sort of like a box that stores values which are to the right of the operator).

Syntax: object <- value/data

  1. Create a new .R script. Name it and save it on your system.
  2. R does all the functions of a calculator. Write some code in the script that
  • Adds two numbers

  • Multiplies three numbers

  • Prints your name

  1. Run each command separately by using cmd + Enter / ctrl + Enter.

  2. Assign the outputs from 2 to different objects.

  3. Print the objects with some description using paste().

  4. Run the whole file.

You can start a new script through many different ways:

  1. ctrl + shift + n

  2. Click on the tiny white page button with a green+sign on the upper left corner of the screen

  3. Click on File > New file> R script

Saving a script:

  1. Ctrl + S

  2. Enter the name of the script, and add .R as a suffix. For example: xyzbasic.R

#2. 
2 + 7

56 * 9 * 33

print("Parushya")

The output is displayed in the console.

#4


sum_2 <- 2 + 7

prod_3 <- 56 * 9 * 33

name <- "parushya"
#5

paste("Sum of 2 and 7 is", sum_2)

paste("Product of 56, 9 and 33 is", prod_3)

paste("This very fancy R code was written by", name)


Objects, Datatypes, and Data Structures

Exercise 2

Run the following code in the same script that we created

class(sum_2)

class(prod_3)

class(name)

Everything in R is called an “object”

“Objects” contain “data”.

The three variables we created - sum_2, prod_3, and name - were all basic objects.

R has 5 basic or “atomic” classes/datatypes of objects.

  1. Character - (abc)

  2. Numeric - (real numbers) - (1,7.5,etc)

  3. Integer - (1,2,0,-896)

  4. Logical - (True/False)

  5. Complex - (1, 0+i)

Data structures are bigger containers that hold many objects.

Two basic or “atomic” data structures in R are:

  1. Vectors: can hold objects of same datatype

  2. Lists: can hold objects with different datatypes

Understanding Vectors

We can create a vector using the “c()” command.

a_num <- c(0,0.7,9,2,3,4,-1)            # numeric or double

b_logical <- c(TRUE,FALSE,TRUE,TRUE,TRUE) # logical

c_logical <- c(T,F,T,T,T) # also logical - Never use T and F as it leads to errors in analysis

d_char <- c("Sheila", "Nila", "Camilla")  # character

e_int <- 1:20 # integer

f_int <- c(1,2,3,4,5)  # integer

g_int <- c(1+0i,2+4i) # complex numbers

Basic vectors are uni-dimensional. We can make a two dimensional vector, which is called matrix.

Working with matrices

# Creating Blank Matrix
m_1 <- matrix(nrow=3,ncol=4)
m_1      
     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA
dim(m_1)
[1] 3 4
?matrix # Help documentation

# Creating Matrix with elements

m_2 <- matrix(1:10, nrow = 3, ncol = 4) # Why the warning?
m_2

# With correct number of elements
m_3 <- matrix(1:18, nrow=9, ncol=2))
m3

Logic of matrices

Matrices are constructed column-wise. So, it fills the upper left corner, and then runs down along.

Indexing in matrices

# Rows & Columns ----    
# Very simply the syntax is:  (2,3) = (Rows, Columns)
# m[1,] - 1st row
# m[2,] - 2nd row
# m[,3] - 3rd column  
# m[,5] - 5th column
# m[,7] - 7th column
# What if you already have a vector?
# Example: You have received a list of students who have skipped school today.
# You know which section they are in, and want to create a matrix.
k <- c("Hashem", "John", "Cecillia", "Minha", "Parushya", "Keeheon")
k
[1] "Hashem"   "John"     "Cecillia" "Minha"    "Parushya" "Keeheon" 
dim(k) <- c(3,2)
k
     [,1]       [,2]      
[1,] "Hashem"   "Minha"   
[2,] "John"     "Parushya"
[3,] "Cecillia" "Keeheon" 
colnames(k) <- c("Section A", "Section B")

k
     Section A  Section B 
[1,] "Hashem"   "Minha"   
[2,] "John"     "Parushya"
[3,] "Cecillia" "Keeheon" 
rownames(k) <- c("Student 1", "Student 2", "Student 3")

k
          Section A  Section B 
Student 1 "Hashem"   "Minha"   
Student 2 "John"     "Parushya"
Student 3 "Cecillia" "Keeheon" 

Binding vectors together to make a matrix

# Binding
x <- 1:3
y <- 4:6
z <- c("Camilla","Nila","Duflo","Akbar")

x
[1] 1 2 3
y
[1] 4 5 6
z
[1] "Camilla" "Nila"    "Duflo"   "Akbar"  
rbind(x,y) # Stitches vector row wise, or appends it horizontally
  [,1] [,2] [,3]
x    1    2    3
y    4    5    6
cbind(x,y) # Stitches vector column wise, or vertically
     x y
[1,] 1 4
[2,] 2 5
[3,] 3 6

Lists

If we want to create something that stores objects of different classes together, we use another data structure called list.

A list can contain two or more classes of objects with different lengths.

Creating lists

list_1 <- list("a" = 2.5, "b" = TRUE, "c" = 1:3)

list_1
$a
[1] 2.5

$b
[1] TRUE

$c
[1] 1 2 3

We created a list with objects of three different types - numeric, logical, and integer vector.

# Structure of the list
str(list_1)
List of 3
 $ a: num 2.5
 $ b: logi TRUE
 $ c: int [1:3] 1 2 3

We can also create a list with existing vectors.

# A new vector
name_vec <- c("Camilla","Nila","Duflo","Akbar")

# And then lets use the vectors we already have in the environment
list_2 <- list(name_vec, c_logical, d_char, f_int, e_int, a_num)
list_2
[[1]]
[1] "Camilla" "Nila"    "Duflo"   "Akbar"  

[[2]]
[1]  TRUE FALSE  TRUE  TRUE  TRUE

[[3]]
[1] "Sheila"  "Nila"    "Camilla"

[[4]]
[1] 1 2 3 4 5

[[5]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[6]]
[1]  0.0  0.7  9.0  2.0  3.0  4.0 -1.0
# let's check the classes of objects
class(list_2[[2]])
class(list_2[[3]])

# And their lengths
length(list_2[[2]])
length(list_2[[3]])

Accessing elements in a List

By indices in a list

# So lists are printed differently, and elements of a list will have [[]] i.e 2 brackets.    

list_2
[[1]]
[1] "Camilla" "Nila"    "Duflo"   "Akbar"  

[[2]]
[1]  TRUE FALSE  TRUE  TRUE  TRUE

[[3]]
[1] "Sheila"  "Nila"    "Camilla"

[[4]]
[1] 1 2 3 4 5

[[5]]
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

[[6]]
[1]  0.0  0.7  9.0  2.0  3.0  4.0 -1.0
## Accessing elements | run each of the follwing lines and see the output
list_2[[2]] 
list_2[2]
list_2[1][2]
list_2[[1]][1]
list_2[[1]][[1]]
list_2[[1]][2]

By using names or tags

list_3 <- list(name = "John", age = 19, speaks = c("English", "French"))

# access elements by name
list_3$name
list_3$age
list_3$speaks

# access elements by integer index
list_3[c(1, 2)]
list_3[-2]

# access elements by logical index
list_3[c(TRUE, FALSE, FALSE)]

# access elements by character index
list_3[c("age", "speaks")]

Modifying lists

Adding components in a list

list_4 <- list(name = "Clair", age = 19, speaks = c("English", "French"))

# assign a new element to the list using double brackets [[]]
list_4[["married"]] <- FALSE

# print the updated list
list_4
$name
[1] "Clair"

$age
[1] 19

$speaks
[1] "English" "French" 

$married
[1] FALSE

Deleting components in a list

list_5 <- list(name = "Clair", age = 19, speaks = c("English", "French"))

# remove an element from the list using double brackets [[]]
list_5[["age"]] <- NULL

# print the structure of the updated list
str(list_5)
List of 2
 $ name  : chr "Clair"
 $ speaks: chr [1:2] "English" "French"
# remove an element from the list using $ notation
list_5$married <- NULL

# print the structure of the updated list
str(list_5)
List of 2
 $ name  : chr "Clair"
 $ speaks: chr [1:2] "English" "French"

  1. Inspired by the summary provided by Prof Aaron Williams’ course on Data Analysis offered at McCourt School. Strongly recommended to learn good coding using R↩︎