Data structures
Data types
Numeric/Double (e.g. 2.5, 1/5, 1.0, )
Integer (e.g. 1, 2, 3, )
Complex (e.g. 1 + 2i, )
Logical (e.g. TRUE, FALSE or NA)
Character (e.g. “a”, “paper”, “2 plus 2 = 5”, “TRUE”, )
Factor/Categorical (“male”, “female”, )
Vectors (I)
You can create a vector using the command c()
## [1] 1 3 5 10
Vectors must contain elements of the same data type.
## [1] "1" "intro" "TRUE"
You can measure the length of a vector using the command length()
## [1] 4
Vectors (II)
It is also possible to easily create sequences
## [1] 1 2 3 4 5 6 7 8 9 10
seq(from = 1, to = 2, by = 0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
## [1] "A" "A" "A" "A" "A"
Vectors (III)
You can combine different vectors
x <- 1:3 # from 1 to 3
y <- c(10, 15) # 10 and 15
z <- c(x,y) # x first and then y
z
## [1] 1 2 3 10 15
And you can repeat vectors (or its elements)
z <- rep(y, each=3) # repeat each element 3 times
z
## [1] 10 10 10 15 15 15
z <- rep(y, times=3) # repeat the whole vector 3 times
z
## [1] 10 15 10 15 10 15
Subsetting Vectors
x <- c(1, 5, 10, 7)
x < 6 # is the element lower than 6?
## [1] TRUE TRUE FALSE FALSE
x == 10 # is the element equal to 10?
## [1] FALSE FALSE TRUE FALSE
x[2] # which element is in the second position?
## [1] 5
x[1:2] # which elements are in the first 2 positions?
## [1] 1 5
x[c(1,3,4)] # which elements are in positions 1, 3 and 4?
## [1] 1 10 7
Subsetting Vectors
n <- c(1, 4, 5, 6, 7, 2, 3, 4, 5, 6) # creates a vector, stores it in n
length(n) # number of elements
## [1] 10
n[3] # extract 3 rd element in n
## [1] 5
n[-2] # extract all of n but 2nd element
## [1] 1 5 6 7 2 3 4 5 6
n[c(1,3,4)] # extract first, third, and fourth element of n
## [1] 1 5 6
n[n < 4] #extract all elements in n smaller than 4
## [1] 1 2 3
n[n < 4 & n != 1] # extract element smaller than 4 AND different from 1
## [1] 2 3
Exercise break
n <- c(1, 4, 5, 6, 7, 2, 3, 4, 5, 6) # creates a vector, stores it in n
- number of elements
- extract first three elements of n
- extract all of n but 3nd element
- extract first, third, and fourth element of n
- extract element smaller than 4 OR equal to 5
Exercise break
We can create and index character vectors as well. A cafe is using R to create their menu.
items <- c("apples", "oranges", "eggs", "tomatoes", "bananas")
- What does items[-3] produce?
- Based on what you find, use indexing to create a version of items without “bananas”.
- Use indexing to create a vector containing apples, eggs, tomatoes, bananas, and bananas.
- Add a new item, “lemons”, to items.
- Make items[3] “berries”.
Vectors’ Operations
x <- c(1,5,10,7)
x+2 # adds a scalar to all elements
## [1] 3 7 12 9
x^2 # what's the square of all elements?
## [1] 1 25 100 49
Matrices (I)
You can create a matrix using the command matrix()
X <- matrix(1:9, nrow = 3, ncol = 3)
X
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
Matrices (II)
R automatically inserts elements by columns, but we can ask to include by rows
X <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE)
X
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
You don’t even have to specify the options names
X <- matrix(1:8, 2, 4, T)
X
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
Matrices (III)
Matrices can also be created by combining vectors
X <- cbind(1:4, 6:9) # binds them as columns
X
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
X <- rbind(1:4, 6:9) # binds them as rows
X
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 6 7 8 9
Subsetting Matrices
X>5 # elements larger than 5
## [,1] [,2] [,3] [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] TRUE TRUE TRUE TRUE
X[1,4] # element of first row, fourth column?
## [1] 4
X[1,] # element in the first row?
## [1] 1 2 3 4
X[,2] # elements in the second columns?
## [1] 2 7
Lists
A list is a one-dimensional heterogeneous data structure.
It is indexed like a vector with a single integer value (or a name), but each element can contain an element of any data type.
x <- 1:4
y <- c("a", "b", "c")
L <- list(numbers = x, letters = y)
L
## $numbers
## [1] 1 2 3 4
##
## $letters
## [1] "a" "b" "c"
Subsetting Lists
L[[1]] # extract the first element
## [1] 1 2 3 4
L$numbers # extract the element called numbers
## [1] 1 2 3 4
L$letters # extract the element called letters
## [1] "a" "b" "c"
You can even “work” with the subsetted element:
## [1] FALSE FALSE TRUE
Exercise break
letters
LETTERS
my_letters <- cbind(LETTERS, letters)
class(my_letters)
dim(my_letters)
my_letters <- cbind(my_letters, seq(1:length(letters)))
What number is the letter F in the English alphabet?
Exercise
# vector ... etc.
my_dna <- "AACGAATGAGTAAATGAGTAAATGAAGGAATGATTATTCCTTGCTTTAGAACTTCTGGAATTAGAGGACA
ATATTAATAATACCATCGCACAGTGTTTCTTTGTTGTTAATGCTACAACATACAAAGAGGAAGCATGCAG"
my_dna
length(my_dna)
class(my_dna)
str(my_dna)
nchar(my_dna)
# appr1
my_dna_comma <- sapply(strsplit(
x = my_dna, split = "", fixed = TRUE),
function(x) paste(x, collapse = "_"))
length(my_dna_comma)
str(my_dna_comma)
my_dna_list <- strsplit(x = my_dna, split = "", fixed = TRUE)
length(my_dna_list)
class(my_dna_list)
my_dna_vector <- unlist(my_dna_list)
length(my_dna_list[[1]])
str(my_dna_vector)
length(my_dna_vector)
# first nucleotide
my_dna_vector[1]
# indexing 1:nchar(my_dna)
my_dna_vector[1:50]
# unique characters
unique(my_dna_vector)
# number of As
(my_dna_vector == "A")
length(my_dna_vector[my_dna_vector == "A"])
# remove \n
remove_nuc <- match("\n", my_dna_vector)
which(my_dna_vector %in% c("\n", "X"))
my_dna_vector[remove_nuc] <- "A"
my_dna_vector_2 <- my_dna_vector[-71]
unique(my_dna_vector_2)
Data Frames (I)
A data.frame
is similar to a typical spreadsheet
in excel.
There are rows, and there are columns.
A row is typically thought of as an .
A column is a certain , characteristic or feature of that observation.
Data Frames (II)
A data frame is a list of column vectors where:
each column has a name
each column must contain the same data type, but the different columns can store different data types.
each column must be of same length
Data Frames (III)
set.seed(1)
df <- data.frame(id = 1:5,
name = c("Diego", "Samuel", "Marco", "Javier", "Leonardo"),
surname = c("Milito", "Eto'o", "Materazzi", "Zanetti", "Bonucci"),
wage = rnorm(n=5, mean = 10^5, sd = 10^3), # normal random sample
origin = c("Argentina", "Cameroon", "Italy", "Argentina", "Italy"),
treble_winner = c(T, T, T, T, F)
)
df
## id name surname wage origin treble_winner
## 1 1 Diego Milito 99373.55 Argentina TRUE
## 2 2 Samuel Eto'o 100183.64 Cameroon TRUE
## 3 3 Marco Materazzi 99164.37 Italy TRUE
## 4 4 Javier Zanetti 101595.28 Argentina TRUE
## 5 5 Leonardo Bonucci 100329.51 Italy FALSE
You can verify the size of the data.frame
using the command dim()
You can get the data type
info using the command str()
Subsetting Data Frames (I)
df$name # subset a column
## [1] "Diego" "Samuel" "Marco" "Javier" "Leonardo"
df[,c(2,5)] # can also subset like a matrix
## name origin
## 1 Diego Argentina
## 2 Samuel Cameroon
## 3 Marco Italy
## 4 Javier Argentina
## 5 Leonardo Italy
Subsetting Data Frames (II)
head(df, n=3) # first n observations
## id name surname wage origin treble_winner
## 1 1 Diego Milito 99373.55 Argentina TRUE
## 2 2 Samuel Eto'o 100183.64 Cameroon TRUE
## 3 3 Marco Materazzi 99164.37 Italy TRUE
tail(df, n=3) # last n observations
## id name surname wage origin treble_winner
## 3 3 Marco Materazzi 99164.37 Italy TRUE
## 4 4 Javier Zanetti 101595.28 Argentina TRUE
## 5 5 Leonardo Bonucci 100329.51 Italy FALSE
Inspecting data frames (I)
R comes with many data bases included. These can be used for learning R.
One of the most famous is the one called mtcars
.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## mpg cyl disp hp drat wt qsec vs am gear carb
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
## [1] 32 11
Inspecting data frames (II)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
Subsetting data frames (III)
We are interesting in the cylinders and the weights of inefficient cars (lower than 15 miles per gallon).
poll_cars <- mtcars[mtcars$mpg<15, c("cyl", "wt")]
poll_cars
## cyl wt
## Duster 360 8 3.570
## Cadillac Fleetwood 8 5.250
## Lincoln Continental 8 5.424
## Chrysler Imperial 8 5.345
## Camaro Z28 8 3.840
Subsetting data frames (IV)
Alternatively:
poll_cars <- subset(mtcars, subset = mpg<15, select = c("cyl", "wt"))
poll_cars
## cyl wt
## Duster 360 8 3.570
## Cadillac Fleetwood 8 5.250
## Lincoln Continental 8 5.424
## Chrysler Imperial 8 5.345
## Camaro Z28 8 3.840
Importing downloaded data frames (.csv)
You can import csv data that you have downloaded from any external source using:
setwd("data")
nyc_ab <- read.csv("AB_NYC_2020.csv")
where:
You can similarly import almost any kind of data file stored in other formats (.xls, .txt, .csv, .dta, .Rdata, .mat, …)
Importing downloaded data frames (.txt)
Interferon regulatory factor 6 mouse
setwd("data")
irf6 <- read.table("irf6.txt", header = TRUE, row.names = 1)
# explore
head(irf6)
ncol(irf6); nrow(irf6)
dim(irf6)
Importing downloaded data frames (.txt)
irf6 <- read.table("data/irf6.txt", header = TRUE, row.names = 1)
class(irf6)
str(irf6)
colnames(irf6)
head(rownames(irf6))
Importing downloaded data frames (.txt)
irf6 <- read.table("data/irf6.txt", header = TRUE, row.names = 1)
head(irf6['E17.5KO1']) # Retrieve only E17.5KO1 data
head(irf6[, 1]) # as well
head(irf6$E17.5KO1) # as well
dim(irf6[, -1]); colnames(irf6[, -1]) # Exclude E17.5KO1 data
