Data structures, mathematical operations, importation and exportation in Base R #
Learning outcomes #
By the end of this topic, you should be able to
- recognise and create different data structures in R
- demonstrate the advantages of different data structures
- perform mathematical operations
- import external data of different formats into R
- export R data into a variety of formats
- perform set operations
Warning: the materials here are so-called “Base R”. There are easier (and often better) ways of doing things with the “tidyverse” ecosystem (see next section).
Note that page numbers refer to the book The R Software.
Data structure - Vectors - page 51 #
- The basic data structure in R is the vector (a sequence of data points), which we have encountered before
# Create a vector called 'myVector'
(myVector <- c(1,2,3))
## [1] 1 2 3
myVector * 3
## [1] 3 6 9
- We now look into some more operations we can perform on vectors
Vector operations - Some basic functions - Page 87 #
- length(): returns the length of a vector.
- sort(): sorts the elements of a vector, in increasing or decreasing order.
- rev(): rearranges the elements of a vector in reverse order.
- rank(): returns the vector of ranks of the elements.
- head(): returns the first few elements of a vector.
- tail(): returns the last few elements of a vector.
Vector operations - Examples - Page 87-88 #
x <- c(1,3,6,2,7,4,8,1,0)
length(x)
## [1] 9
sort(x)
## [1] 0 1 1 2 3 4 6 7 8
sort(x, decreasing=TRUE)
## [1] 8 7 6 4 3 2 1 1 0
rev(x)
## [1] 0 1 8 4 7 2 6 3 1
rank(x)
## [1] 2.5 5.0 7.0 4.0 8.0 6.0 9.0 2.5 1.0
head(x, 3)
## [1] 1 3 6
tail(x, 2)
## [1] 1 0
Data structure - Matrices and arrays - page 51 #
- Matrices and arrays are generalisations of vectors
- A matrix has two dimensions (hence you need two indices to access a data point)
- An array allows for even more dimensions (hence you need multiple indices)
Data structure - Matrices and arrays - page 52 #
(X <- matrix(1:12, nrow=4, ncol=3, byrow=TRUE))
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
(X <- matrix(1:12, nrow=4, ncol=3, byrow=FALSE))
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
class(X)
## [1] "matrix" "array"
(X <- array(1:6, dim=c(2,1,3)))
## , , 1
##
## [,1]
## [1,] 1
## [2,] 2
##
## , , 2
##
## [,1]
## [1,] 3
## [2,] 4
##
## , , 3
##
## [,1]
## [1,] 5
## [2,] 6
class(X)
## [1] "array"
Note: we used function class() to identify the stucture of X.
Data structure - Matrices and arrays - page 53 #
Take a few minutes to think about how to interpret a three-dimensional array. The image below may help you.
Recycling - Pages 86-87 #
Given an operation on two vectors/matrices/arrays of different lengths, R will complete the shortest data structure by repeating its elements from the beginning. We call this behaviour ‘recycling’:
x <- c(1,2,3,4,5,6)
y <- c(1,2,3)
x + y
## [1] 2 4 6 5 7 9
Another example is below, where the vector 1:3 is repeated to fill in a matrix:
matrix(1:3, ncol=3, nrow=4)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 2 3 1
## [3,] 3 1 2
## [4,] 1 2 3
Merging - Merging columns - Page 89 #
You can merge vectors or matrices together to create a new matrix with functions cbind() and rbind().
(B <- cbind(1:4,5:8))
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
(C <- cbind(B, 9:12))
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
class(C)
## [1] "matrix" "array"
Try guess what rbind() does, then look it up in the book or on the internet.
Matrix operations - Pages 315-316-317 #
You can perform ‘usual’ mathematical operations on matrices. What are the mathematical meaning of the operations performed below?
A <- matrix(c(2,3,5,4), nrow=2, ncol=2, byrow=T)
B <- matrix(c(1,2,8,7), nrow=2, ncol=2, byrow=F)
I2 <- diag(nrow=2) # identity matrix of size 2x2
A
## [,1] [,2]
## [1,] 2 3
## [2,] 5 4
B
## [,1] [,2]
## [1,] 1 8
## [2,] 2 7
A+B
## [,1] [,2]
## [1,] 3 11
## [2,] 7 11
A*B
## [,1] [,2]
## [1,] 2 24
## [2,] 10 28
A/B
## [,1] [,2]
## [1,] 2.0 0.3750000
## [2,] 2.5 0.5714286
A%*%I2
## [,1] [,2]
## [1,] 2 3
## [2,] 5 4
A%*%B
## [,1] [,2]
## [1,] 8 37
## [2,] 13 68
t(B)
## [,1] [,2]
## [1,] 1 2
## [2,] 8 7
Note: the diag() function can also be used for different purposes (see R help file)
Matrix operations - the solve() function - Pages 316-317 #
The solve(a,b) function can be used to solve \(a \%*\% x = b\)
, for \(x\)
. Here \(b\)
can be a vector or a matrix. If ‘solve’ is used with only one argument, e.g. solve(A), it will return the inverse of a matrix (if it exists).
x <- solve(A, c(1,1))
A%*%x
## [,1]
## [1,] 1
## [2,] 1
solve(B) %*% B
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
Matrix operations - The function apply() - Page 93 #
The function apply() is often quite handy. It applies a given function to the elements of all rows (MARGIN=1) or all columns (MARGIN=2) of a matrix.
(X <- matrix(c(1:4, 1, 6:8), nr = 2))
## [,1] [,2] [,3] [,4]
## [1,] 1 3 1 7
## [2,] 2 4 6 8
apply(X, MARGIN=1, FUN=sum)
## [1] 12 20
apply(X, MARGIN=2, FUN=mean)
## [1] 1.5 3.5 3.5 7.5
Other functions you could use: rowSums(), colSums(), rowMeans(), colMeans().
Important note on data structures #
- Do not confuse ‘data structure’ (vector, matrix, array,…) with ‘data type’ (which we saw in Week 1).
- A ‘data type’ refers to the type of information (numerical, string, logical, etc.) while a ‘data structure’ refers to how we store (or structure!) the information (in a vector, in a matrix, etc.)
Data structure - Lists - page 53 #
- Elements stored in vectors, matrices or arrays need to be of the same type (and R automatically converts them to the same type if they are not)
myVector <- c(1,2,"A", TRUE)
myVector
## [1] "1" "2" "A" "TRUE"
typeof(myVector)
## [1] "character"
- Lists can group together, in one structure, data of different types without altering them.
myList <- list(TRUE, my.matrix=matrix(1:4, nrow=2), c(1+2i,3), "A character string")
myList
## [[1]]
## [1] TRUE
##
## $my.matrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## [1] 1+2i 3+0i
##
## [[4]]
## [1] "A character string"
Data structure - Data frames - page 54 #
A data.frame in R is a table where
- each row represents a single observation (e.g. an individual)
- each column represents a single variable, which must be of the same data type across all rows
Data frames are widely used in R
- flexibility of having multiple data types
- in many cases, a dataset can be formulated as a data.frame
Data structure - Data frames #
BMI <- data.frame(
Gender=c("M","F","M","F"),
Height=c(1.83,1.76,1.82,1.60),
Weight=c(67,58,66,48),
row.names=c("Jack","Julia","Henry","Emma"))
BMI
## Gender Height Weight
## Jack M 1.83 67
## Julia F 1.76 58
## Henry M 1.82 66
## Emma F 1.60 48
str(BMI)
## 'data.frame': 4 obs. of 3 variables:
## $ Gender: chr "M" "F" "M" "F"
## $ Height: num 1.83 1.76 1.82 1.6
## $ Weight: num 67 58 66 48
You can access a specific variable using the $ command
BMI$Gender
## [1] "M" "F" "M" "F"
Merging - Merging columns of data frames - Page 89-91 #
One may need to add to a data set some variables from another dataset (that has the same subjects). In these cases, you should use the merge() function with appropriate ‘by’ arguments. Let’s see an example.
X <- data.frame(GENDER=c("F","M","M","F"),
ID=c(123,234,345,456),
NAME=c("Mary","James","James","Olivia"),
Height=c(170,180,185,160))
Y <- data.frame(GENDER=c("M","F","F","M"),
ID=c(345,456,123,234),
NAME=c("James","Olivia","Mary","James"),
Weight=c(80,50,70,60))
X
## GENDER ID NAME Height
## 1 F 123 Mary 170
## 2 M 234 James 180
## 3 M 345 James 185
## 4 F 456 Olivia 160
Y
## GENDER ID NAME Weight
## 1 M 345 James 80
## 2 F 456 Olivia 50
## 3 F 123 Mary 70
## 4 M 234 James 60
cbind(X,Y)
## GENDER ID NAME Height GENDER ID NAME Weight
## 1 F 123 Mary 170 M 345 James 80
## 2 M 234 James 180 F 456 Olivia 50
## 3 M 345 James 185 F 123 Mary 70
## 4 F 456 Olivia 160 M 234 James 60
merge(X,Y, by=c("NAME")) # Not good, because NAME is not unique among rows
## NAME GENDER.x ID.x Height GENDER.y ID.y Weight
## 1 James M 234 180 M 345 80
## 2 James M 234 180 M 234 60
## 3 James M 345 185 M 345 80
## 4 James M 345 185 M 234 60
## 5 Mary F 123 170 F 123 70
## 6 Olivia F 456 160 F 456 50
merge(X,Y, by=c("ID")) # Better
## ID GENDER.x NAME.x Height GENDER.y NAME.y Weight
## 1 123 F Mary 170 F Mary 70
## 2 234 M James 180 M James 60
## 3 345 M James 185 M James 80
## 4 456 F Olivia 160 F Olivia 50
merge(X,Y) # Does not replicate variables present in both data sets
## GENDER ID NAME Height Weight
## 1 F 123 Mary 170 70
## 2 F 456 Olivia 160 50
## 3 M 234 James 180 60
## 4 M 345 James 185 80
Merging - Example #
One can use the all.x argument to include the rows where the values of the by argument are not shared in both datasets.
current_courses <- data.frame( Course_Code = c("ACTL1101", "MATH1251", "ACCT1511", "ECON1102"))
actl_info <- data.frame(Course_Code = c("ACTL1101", "ACTL2111", "ACTL2131", "ACTL2102"),
Course_Name = c("Introduction to Actuarial Studies",
"Financial Mathematics for Actuaries",
"Probability and Mathematical Statistics",
"Foundations of Actuarial Models"))
merge(x=current_courses, y=actl_info, by = "Course_Code", all.x = TRUE)
## Course_Code Course_Name
## 1 ACCT1511 <NA>
## 2 ACTL1101 Introduction to Actuarial Studies
## 3 ECON1102 <NA>
## 4 MATH1251 <NA>
merge(x=current_courses, y=actl_info, by = "Course_Code")
## Course_Code Course_Name
## 1 ACTL1101 Introduction to Actuarial Studies
Data structure - Factors #
A factor can be used to store character strings
- each element is treated as a factor (even if the input is a real number)
- some functions require data structured as a factor
x <- factor(c("blue","green","blue","red","blue","green","green"))
levels(x)
## [1] "blue" "green" "red"
class(x)
## [1] "factor"
Note: Function ‘levels()’ shows you all unique factors, which can be handy.
Data structure - Summary #
Data structure | Instruction in R | Description |
---|---|---|
vector | c() | Sequence of elements of the same nature |
matrix | matrix() | Two-dimensional table of elements of the same nature |
array | array() | More general than a matrix; table with several dimensions |
list | list() | Sequence of R structures of any (and possibly different) nature. |
data frame | data.frame() | Two-dimensional table. The columns can be of different natures, but must have the same length. |
factor | factor() | Vector of character strings associated with a modality table |
dates | as.Date() | Vector of dates |
time series | ts() | Values of a variable observed at several time points |
Importing data - text files - Page 64 #
- R is a great tool to analyse data… but we first need to get our data into R!
- There are three main R functions to import data from a text file. Here we focus on the read.table and read.csv functions, which are widely used to import excel and csv files.
Function name | Description |
---|---|
read.table() | best suited for data sets presented as tables, as it is often the case in statistics |
read.ftable() | reads contingency tables |
scan() | much more flexible and powerful. Use this in all other cases |
Importing data - read.table() - Pages 64-65 #
Argument name | Description |
---|---|
file=path/to/file | Location and name of the file to be read |
header=TRUE | Indicates whether the variable names are given on the first line of the file |
sep="" | This is the field separator character. Values on each line of the file are separated by this character. (e.g. "" = whitespace, “,” = comma, “$\backslash$t” = tabulation) |
dec="." | Decimal mark for numbers ("." or “,") |
row.names=1 | 1st column of the file gives the individuals' names |
To read data from a text file:
my.data <- read.table(file=file.choose(), header=TRUE, sep="\t", dec=".", row.names=1)
The path can be specified explicitly:
my.data <- read.table(file="C:/MyFolder/somedata.txt")
Alternatively, you can use double back slash \\ instead of /.
Importing data - read.table() - Pages 64-66 #
Note:
- If the file is located in your current working directory, you do not need to specify the full path
data <- read.table(file="mydata.txt")
- You can visualise the beginning or end of your data by using head(data) or tail(data)
Importing data - Exercise - Page 66 #
- Download the Danish fire data from Moodle and save it to your working directory (set up the working directory somewhere on your computer, or start a new project in RStudio). Name it danishfire.txt.
- Use the function readLines() to visualize the beginning of the data. Why should we do so?
- Import the data and store it in a data frame called danish_fire.
- Display the first few records of danish_fire by using the head() function.
- Display the structure of danish_fire.
- Calculate the mean of the building losses, as well as the correlation between the building losses and contents losses. Hint: Use functions mean() and cor().
Importing data - Solution - Page 66 #
Use the function readLines() to visualize the beginning of the data file before you read the data, so you know the structure of the data and are able to determine the arguments of the function read.table()
#Careful: this line will be different for you, enter the path of where 'danishfire.txt' is for you
setwd("C:/Users/z3519303/Dropbox/ACTL1101 2019/R_lab_2019/Week 2")
readLines("danishfire.txt",n=3)
Now let’s import the data (note the first row is the header)
# import data, note that the header argument is TRUE
danish_fire <- read.table(file = "danishfire.txt", header = TRUE)
head(danish_fire)
class(danish_fire)
str(danish_fire)
# some numerical analysis
mean(danish_fire$building)
cor(danish_fire$building, danish_fire$contents)
Importing data - Standard formats (e.g. csv) - Page 67 #
With standard formats, most of the arguments of the function read.table() are fixed. There are some functions that are designed to be equivalent to read.table() with several arguments filled with pre-determined values:
- read.csv(): .csv format (csv stands for comma separated values)
- read.delim(): Tab-separated data
Exporting data - Pages 72-73 #
Now try to export the danish fire data you have imported to R.
# Exporting data to a text file
write.table(danish_fire, file = "myfile.txt", sep = "\t")
# Exporting data to a csv file
write.csv(danish_fire, file = "myfile.csv")
You can also save data to the clipboard and then paste them into your spreadsheet.
Additional notes - data structures #
There are many functions related to data structures in R, and it is practically impossible to go through them all in class. We have summarised some useful contents below.
- Replicating elements (see Pages 73-74)
rep(1,4)
## [1] 1 1 1 1
rep(1:4, 2)
## [1] 1 2 3 4 1 2 3 4
rep(1:4, each = 2)
## [1] 1 1 2 2 3 3 4 4
rep(1:4, c(2,1,2,3))
## [1] 1 1 2 3 3 4 4 4
rep(1:4, each = 2, len = 4)
## [1] 1 1 2 2
rep(1:3, each = 2, times = 3)
## [1] 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3
- Naming the elements of a vector using the function names (see Page 51)
vec <- 1:10
names(vec) <- letters[1:10]
vec
## a b c d e f g h i j
## 1 2 3 4 5 6 7 8 9 10
Here, the built-in constants letters[1:10] return the first 10 letters of the alphabet.
-
There are other data structures we haven’t introduced yet. Two common ones are the date structure and time series structure.
- You can use the as.Date() function for dates
dates <- c("92/27/02","92/02/27")
as.Date(dates,"%y/%m/%d")
## [1] NA "1992-02-27"
- You can also create a time series structure
ts(1:10, frequency=4, start=c(1959,2))
## Qtr1 Qtr2 Qtr3 Qtr4
## 1959 1 2 3
## 1960 4 5 6 7
## 1961 8 9 10
Additional notes - read.ftable - see Page 67 #
Following is an example of a contingency table
“alcohol” | “never” | “occasional” | “regular” | |
---|---|---|---|---|
“GENDER” | “tobacco” | |||
“M” | “non-smoker” | 6 | 19 | 7 |
“former smoker” | 0 | 9 | 0 | |
“smoker” | 1 | 6 | 5 | |
“F” | “non-smoker” | 12 | 26 | 2 |
“former smoker” | 3 | 5 | 1 | |
“smoker” | 1 | 6 | 1 |
Intima.table <- read.ftable("Intima_ftable.txt",row.var.names=c("GENDER","tobacco"),col.vars=list("alcohol"=c("never","occasional","regular")))
Intima.table
Additional notes - scan - see Page 68 #
Function scan() should be used when the data are not organized as a rectangular table.For example, suppose your data file contains the following lines:
# Reading variable names:
variable.name <- scan("Intima_Media2.txt",skip=4,nlines=1,what="")
# Reading data:
data <- scan("Intima_Media2.txt",skip=7,dec=",")
mytable <- as.data.frame(matrix(data,ncol=9,byrow=TRUE))
colnames(mytable) <- variable.name
mytable
Note
- The argument skip=n is used to omit reading the first n lines of the file
- You can read data files directly from Internet with read.table() and scan()
Additional notes - Importing data from Excel (see Pages 69-70) #
-
Approach One: Using copy-paste:
-
Using the mouse, select the range of the data (in the spreadsheet) which you wish to incorporate into R. Once the data are selected, copy them to the clipboard (from the Edit menu, or with the keyboard shortcuts Ctrl+C on Windows or Command+C on a Mac)
-
All you need to do now is type the following instructions in the R console to transfer the data from the clipboard.
- x <- read.table(file(“clipboard”),sep="\t”,header=TRUE,dec=",")
-
-
Approach Two: Using an intermediary text file:
-Save your file as a csv file or tab-separated file, then you can use the procedures we discussed in the previous slides.
Additional notes - Importing data from some other software - Page 71 #
Software | “Package” | “R function” | “File extension” | “Output format” |
---|---|---|---|---|
SPSS | foreign | read.spss() | *.sav | list |
Minitab | foreign | read.mtp() | *.mtp | list |
SAS | foreign | read.xport() | *.xpt | data.frame |
Matlab | R.matlab | readMat() | *.mat | list |
Additional notes - Importing large data files - Page 71 #
R can handle large data sets quickly and efficiently. For this, you need to specify explicitly the type of each column using the argument colClasses of the function read.table().
tm <- Sys.time() # Gets the current time.
dbsnp <- read.table("dbsnp123.dat")
Sys.time() - tm
tm <- Sys.time()
dbsnp2 <- read.table("dbsnp123.dat",
colClasses=rep("character",3))
Sys.time() - tm
Additional notes - Matrix operations - Diagonals, upper and lower matrices - Page 319 #
Execute the following instructions. What are the mathematical meaning of the operations?
M <- matrix(1:9,nrow=3)
diag(M)
## [1] 1 5 9
M[lower.tri(M)]
## [1] 2 3 6
M[lower.tri(M)] <- 0
M[upper.tri(M)] <- 0
M
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 5 0
## [3,] 0 0 9
Additional notes - Merging rows - Pages 92-93 #
- Method 1: The generic function is rbind().
rbind(1:4,5:8)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
- Method 2: Alternatively, you can install the gtools package (via the Packages windows in Rstudio) and use the smartbind() function
df1 <- data.frame(A=1:3, B=LETTERS[1:3])
df2 <- data.frame(A=6:8, E=letters[1:3])
smartbind(df1, df2)
Additional notes - Set operations #
- R allows for set operations - Page 99
A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)
is.element(C,A)
## [1] TRUE FALSE TRUE
all(A%in%B)
## [1] FALSE
intersect(A,B)
## [1] 2 7
union(A,B)
## [1] 4 5 2 7 1 3
setdiff(A,B)
## [1] 4 5
- Exercises
Given that
A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)
Calculate
- the collection of elements of A and B that only belong to one set
- the number of elements that belong to A, B and C
- Solutions
# the collection of elements of A and B that only belong to one set
union(setdiff(A,B), setdiff(B,A))
## [1] 4 5 1 3
# the number of elements that belong to A, B and C
length(intersect(intersect(A,B), C))
## [1] 2
Relevant exercises #
- 3.14, 3.15, 4.1-4.15, 5.2, 5.4, 5.5, 10.1, 10.3, 10.5-10.10