Data manipulation, operations and work strategy in Base R #
Learning outcomes #
By the end of this topic, you should be able to
-
extract elements from different R data structures in an efficient way
-
manipulate different types of R data structures with appropriate R functions
-
implement work strategies of R and R Studio
Warning: the materials here are so-called “Base R”. There are easier (and often better) ways of doing things with the “tidyverse” ecosystem (see next section).
Note that page numbers refer to the book The R Software.
Vectors - Extracting elements from vectors - Method I - Page 100 #
By using the index/indices
vec <- c(2,4,6,8,3)
vec[2]
## [1] 4
vec[-2]
## [1] 2 6 8 3
vec[2:5]
## [1] 4 6 8 3
vec[-c(1,5)]
## [1] 4 6 8
Vectors - Extracting elements from vectors - Method II - Pages 100-101 #
By using logical masks (that is, by specifying whether each element will be extracted using a logical index):
vec
## [1] 2 4 6 8 3
vec[c(F,F,T,T,F)]
## [1] 6 8
vec[vec>4]
## [1] 6 8
Advantages: fast execution and easy communication.
Vectors - Extracting elements from vectors - Three useful functions - Page 101 #
x <- c(1,9,0,0,-5,9,-5)
which.max(x)
## [1] 2
which.min(x)
## [1] 5
which(x==0)
## [1] 3 4
Matrix - Extraction from matrices- Method I - Pages 102-103 #
By using indices X[indr, indc]
indr
is the vector of indices of rows andindc
is the vector of indices of columns to extract- omitting
indr
(respectivelyindc
) means that all rows are selected (respectively all columns) indr
andindc
can be preceded by a minus sign (-) to indicate elements not to extract
Mat <- matrix(1:12,nrow=4,byrow=TRUE)
Mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
Mat[2,3]
## [1] 6
Mat[,1]
## [1] 1 4 7 10
Mat[c(1,4),]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 10 11 12
Mat[3,-c(1,3)]
## [1] 8
Matrix - Extraction from matrices- Method II - Pages 102-103 #
By using logical masks X[mask]
- a mask is a matrix of logical values (TRUE or FALSE) of the same size as X, indicating which elements to extract
Mat <- matrix(1:12,nrow=4,byrow=TRUE)
MatLogical <- matrix(c(TRUE,FALSE),nrow=4,ncol=3)
Mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
MatLogical
## [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] FALSE FALSE FALSE
## [3,] TRUE TRUE TRUE
## [4,] FALSE FALSE FALSE
Mat[MatLogical]
## [1] 1 7 2 8 3 9
Matrix - The which() function for matrices - Page 104 #
m <- matrix(c(1,2,3,1,2,3,2,1,3),3,3)
m
## [,1] [,2] [,3]
## [1,] 1 1 2
## [2,] 2 2 1
## [3,] 3 3 3
which(m == 1)
## [1] 1 4 8
# m is seen as the concatenation of its columns
which(m == 1,arr.ind=TRUE)
## row col
## [1,] 1 1
## [2,] 1 2
## [3,] 2 3
# this gives the indices (row and column) of all elements = 1
Data frame example #
X <- data.frame(GENDER=c("F","M","M","F"),
Height=c(165,182,178,160),
Weight=c(50,65,67,55),
Income=c(80,90,60,50))
row.names(X)<-c('Aiyana','James','John','Abbey')
X
## GENDER Height Weight Income
## Aiyana F 165 50 80
## James M 182 65 90
## John M 178 67 60
## Abbey F 160 55 50
X[3,2]
## [1] 178
X['John','Height']
## [1] 178
X$Weight
## [1] 50 65 67 55
X["John",]
## GENDER Height Weight Income
## John M 178 67 60
X[,"Height"]
## [1] 165 182 178 160
The attach() function #
The function attach(some.data.frame)
creates a workspace where each variable of the data frame can be accessed directly (without using the dollar sign $
). Therefore it can reduce the length of codes. However, this space does not ‘communicate’ with the working directory. Therefore, if you make a change in the data frame, it will not be effective in this additional workspace. We then recommend to be very careful with the attach()
function (especially for large projects).
attach(X)
GENDER
## [1] "F" "M" "M" "F"
# We update GENDER in our data frame X
X$GENDER <- c("F", "F", "F", "F")
X$GENDER
## [1] "F" "F" "F" "F"
# However, the version of X we 'attached' has NOT been updated
GENDER
## [1] "F" "M" "M" "F"
dplyr - Introduction #
dplyr is a commonly used package. It makes manipulating dataframes very easy and intuitive. To install (and load) any package, just type
install.packages('dplyr') # Installs the package, you need this only once
library('dplyr') # Loads the package, you need this every time you use it
There are 6 main functions in dplyr:
-
filter - selects rows
-
select - selects columns
-
arrange - sorts rows by some order
-
mutate - creates new columns
-
group_by - groups certain rows together
-
summarise - provides summaries of grouped rows
dplyr - Filter and Select #
Here some examples using the in-built data frame ‘mtcars’
# Filters the dataset for observations with mpg greater than 20
filter(mtcars, mpg > 30)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Returns the dataframe mtcars but only with the 3 specified columns
select(mtcars, mpg, cyl, wt)
## mpg cyl wt
## Mazda RX4 21.0 6 2.620
## Mazda RX4 Wag 21.0 6 2.875
## Datsun 710 22.8 4 2.320
## Hornet 4 Drive 21.4 6 3.215
## Hornet Sportabout 18.7 8 3.440
## Valiant 18.1 6 3.460
## Duster 360 14.3 8 3.570
## Merc 240D 24.4 4 3.190
## Merc 230 22.8 4 3.150
## Merc 280 19.2 6 3.440
## Merc 280C 17.8 6 3.440
## Merc 450SE 16.4 8 4.070
## Merc 450SL 17.3 8 3.730
## Merc 450SLC 15.2 8 3.780
## Cadillac Fleetwood 10.4 8 5.250
## Lincoln Continental 10.4 8 5.424
## Chrysler Imperial 14.7 8 5.345
## Fiat 128 32.4 4 2.200
## Honda Civic 30.4 4 1.615
## Toyota Corolla 33.9 4 1.835
## Toyota Corona 21.5 4 2.465
## Dodge Challenger 15.5 8 3.520
## AMC Javelin 15.2 8 3.435
## Camaro Z28 13.3 8 3.840
## Pontiac Firebird 19.2 8 3.845
## Fiat X1-9 27.3 4 1.935
## Porsche 914-2 26.0 4 2.140
## Lotus Europa 30.4 4 1.513
## Ford Pantera L 15.8 8 3.170
## Ferrari Dino 19.7 6 2.770
## Maserati Bora 15.0 8 3.570
## Volvo 142E 21.4 4 2.780
# Returns the columns mpg and cyl for rows with wt greater than 2
select(filter(mtcars, wt > 2), mpg, cyl)
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
## Duster 360 14.3 8
## Merc 240D 24.4 4
## Merc 230 22.8 4
## Merc 280 19.2 6
## Merc 280C 17.8 6
## Merc 450SE 16.4 8
## Merc 450SL 17.3 8
## Merc 450SLC 15.2 8
## Cadillac Fleetwood 10.4 8
## Lincoln Continental 10.4 8
## Chrysler Imperial 14.7 8
## Fiat 128 32.4 4
## Toyota Corona 21.5 4
## Dodge Challenger 15.5 8
## AMC Javelin 15.2 8
## Camaro Z28 13.3 8
## Pontiac Firebird 19.2 8
## Porsche 914-2 26.0 4
## Ford Pantera L 15.8 8
## Ferrari Dino 19.7 6
## Maserati Bora 15.0 8
## Volvo 142E 21.4 4
dplyr - Pipeline operator, Arrange and Mutate #
As you apply more operations to your dataframe, the nested functions you use can get hard to read
Luckily, the dplyr package includes the useful ‘pipeline’ operator %>% (orally referred to as “pipe”, such as in “A pipe B”). Think of it as “and then”
# Take the dataframe, and then filter for mpg greater than 20
mtcars %>% filter(mpg>30)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Do the same, and then select certain columns
mtcars %>%
filter(mpg > 30) %>%
select(mpg, wt)
## mpg wt
## Fiat 128 32.4 2.200
## Honda Civic 30.4 1.615
## Toyota Corolla 33.9 1.835
## Lotus Europa 30.4 1.513
########################### arrange ##################################
# Sort the rows in ascending order of wt
mtcars %>%
filter(mpg > 30) %>%
arrange(wt)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Sort the rows in descending order of wt
mtcars %>%
filter(mpg > 30) %>%
arrange(desc(wt))
## mpg cyl disp hp drat wt qsec vs am gear carb
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
########################### mutate ###################################
mtcars %>%
mutate(double_wt = wt*2, # Creates a new variable called double_wt
kms_per_gallon = mpg*1.61, # Creates a new variable called kms_per_gallon
name = row.names(mtcars)) %>% # Creates a new variable with names of the cars
select(name, double_wt, kms_per_gallon)
## name double_wt kms_per_gallon
## 1 Mazda RX4 5.240 33.810
## 2 Mazda RX4 Wag 5.750 33.810
## 3 Datsun 710 4.640 36.708
## 4 Hornet 4 Drive 6.430 34.454
## 5 Hornet Sportabout 6.880 30.107
## 6 Valiant 6.920 29.141
## 7 Duster 360 7.140 23.023
## 8 Merc 240D 6.380 39.284
## 9 Merc 230 6.300 36.708
## 10 Merc 280 6.880 30.912
## 11 Merc 280C 6.880 28.658
## 12 Merc 450SE 8.140 26.404
## 13 Merc 450SL 7.460 27.853
## 14 Merc 450SLC 7.560 24.472
## 15 Cadillac Fleetwood 10.500 16.744
## 16 Lincoln Continental 10.848 16.744
## 17 Chrysler Imperial 10.690 23.667
## 18 Fiat 128 4.400 52.164
## 19 Honda Civic 3.230 48.944
## 20 Toyota Corolla 3.670 54.579
## 21 Toyota Corona 4.930 34.615
## 22 Dodge Challenger 7.040 24.955
## 23 AMC Javelin 6.870 24.472
## 24 Camaro Z28 7.680 21.413
## 25 Pontiac Firebird 7.690 30.912
## 26 Fiat X1-9 3.870 43.953
## 27 Porsche 914-2 4.280 41.860
## 28 Lotus Europa 3.026 48.944
## 29 Ford Pantera L 6.340 25.438
## 30 Ferrari Dino 5.540 31.717
## 31 Maserati Bora 7.140 24.150
## 32 Volvo 142E 5.560 34.454
dplyr - Why use the pipeline operator? #
The pipeline operator makes it much easier to read what is going on. Note that the following two pieces of code achieve the same thing.
# Using base R
arrange(select(filter(mtcars, mpg > 30), mpg, wt, cyl), desc(wt))
## mpg wt cyl
## Fiat 128 32.4 2.200 4
## Toyota Corolla 33.9 1.835 4
## Honda Civic 30.4 1.615 4
## Lotus Europa 30.4 1.513 4
# Using dplyr
mtcars %>%
filter(mpg > 30) %>%
select(mpg, wt, cyl) %>%
arrange(desc(wt))
## mpg wt cyl
## Fiat 128 32.4 2.200 4
## Toyota Corolla 33.9 1.835 4
## Honda Civic 30.4 1.615 4
## Lotus Europa 30.4 1.513 4
It is hard to quickly get what the base R version is doing. In contrast, you can easily interpret the dplyr versions as:
-
take the mtcars dataset, and then
-
filter for mpg > 30, and then
-
select the columns mpg, wt, cyl, and then
-
arrange them by descending weight
dplyr - summarise and group_by #
summarise() will give you summary statistics
mtcars %>%
summarise(mean_mpg = mean(mpg),
num_cars = n(),
sd_wt = sd(wt))
## mean_mpg num_cars sd_wt
## 1 20.09062 32 0.9784574
group_by is a very powerful function that will group rows into different categories. You can then use summarise() to find summary statistics of each group
mtcars %>%
group_by(cyl) %>% # Group cars by how many cylinders they have
summarise(avg_mpg = mean(mpg)) # Find the average mpg for each group (cyl)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## cyl avg_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
Common summary statistics include:
- mean() the average
- sd() the standard deviation
- n() the count
dplyr - Why use dplyr? #
dplyr is generally easier to read and write than base R. The following codes do the same thing
Base R:
new_df <- mtcars[mtcars$mpg > 30 & mtcars$wt < 2,]
new_df["kmpl"] <- new_df$mpg * 0.264172 * 1.60934
new_df <- new_df[, c("kmpl", "disp", "wt")]
dplyr:
new_df <- mtcars %>%
mutate(name = rownames(mtcars),
kmpl = mpg * 0.264172 * 1.60934) %>%
filter(mpg > 30,
wt < 2) %>%
select(name,
kmpl,
disp,
wt)
In particular, group_by and summarise are very easy to write with dplyr. They are significantly harder to write in base R:
Base R:
avg_weight <- aggregate(mtcars$wt,
by = list(cylinders = mtcars$cyl,
gears = mtcars$gear),
FUN = mean)
names(avg_weight)[3] <- "Average_weight"
sd_weight <- aggregate(mtcars$wt,
by = list(cylinders = mtcars$cyl,
gears = mtcars$gear),
FUN = sd)
names(sd_weight)[3] <- "std_dev_weight"
new_df <- merge(avg_weight, sd_weight, by = c("cylinders", "gears"))
dplyr:
new_df <- mtcars %>%
group_by(cyl, gear) %>%
summarise(avg_weight = mean(wt),
sd_weight = sd(wt))
## `summarise()` regrouping output by 'cyl' (override with `.groups` argument)
dplyr - More information #
-
Cheat sheet for dplyr - https://ugoproto.github.io/ugo_r_doc/dplyr.pdf
-
Online tutorial - http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
-
Pick and choose from this Lynda course on: “tidyverse”
If you complete the Lynda course this will give you a certification that you can them add to your LinkedIn account!
Work strategy - page 285 #
Recall that to access files directly, you can set up your working directory using:
- setwd(“C:/Users/…").
- in Rstudio: Go to Session
\(\rightarrow\)
Set Working Directory
You can also read and execute the contents of an external R script file using the source() function:
- source(“C:/Users/…/myscript.R”)
Alternatively, you can use the file.choose() function which allows to ‘manually’ select a file
- source(file.choose())
Note: a R script file has the extension ‘.R’
Work strategy - page 285 #
Workspace and .RData files
- All the objects (vectors, lists, dataframes, functions, etc.) one creates in R are temporarily saved in a file on the hard disk, which is called the workspace. When you close your R session those objects are lost.
- Therefore, it is recommended to create separate permanent workspace files for each of your projects.
- These files have the ‘.RData’ extension.
- Warning: If you load a .RData workspace in your current R session, any object defined in your temporary workspace will be overwritten if there is an object of the same name in the loaded workspace.
Work strategy - page 285 #
To save your workspace .RData file (also called workspace image):
- save.image(“C:/Users/…/myproject.RData”)
- or in Rstudio: Go to Session
\(\rightarrow\)
Save Work Space As… - or, when you quit a RStudio session, you are asked whether you want to save your workspace
To load a (previsously saved) workspace:
- load(“C:/Users/…/myproject.RData”)
- in Rstudio: Go to Session
\(\rightarrow\)
Load Work Space…
Note: The instructions above are to save your workspace (all objects produced by your script). It is not the same as saving your actual R script!
Work stategy - Exercise #
Which of the following R codes saves the workspace image as week_5 in the directory C:\Users\student\Documents\ACTL1101
?
save.image("C:\Users\student\Documents\ACTL1101\week_5.rdata")
save.image("C:/Users/student/Documents/ACTL1101/week_5")
setwd("C:/Users/student/Documents/ACTL1101")
save.image(week_5.rdata)
setwd("C:\\Users\\student\\Documents\\ACTL1101")
save.image("week_5.rdata")
Work stategy - Solution #
Which of the following R codes saves the workspace image as week_5 in the directory C:\Users\student\Documents\ACTL1101
?
save.image("C:\Users\student\Documents\ACTL1101\week_5.rdata")
no, need to use \\
or /
save.image("C:/Users/student/Documents/ACTL1101/week_5")
no, missing .rdata after week_5
setwd("C:/Users/student/Documents/ACTL1101")
save.image(week_5.rdata)
no, forgot to put week_5.rdata between quotation marks ""
setwd("C:\\Users\\student\\Documents\\ACTL1101")
save.image("week_5.rdata")
yes, full marks!
Additional Notes #
There are many other vector and matrix operations that you may find useful. You should revise the contents below and refer to the textbook for more details.
- inserting elements into vectors (see Pages 101-102), e.g.
vecA <- c(1, 3, 6, 2, 7, 4, 8, 1, 0)
vecB <- c(vecA, 4, 1)
(vecC <- c(vecA[1:4], 8, 5, vecA[5:9]))
## [1] 1 3 6 2 8 5 7 4 8 1 0
a <- c()
a <- c(a,2)
a
## [1] 2
- modifying the elements in a vector (see Page 101), e.g.
z<-c(0,0,0,2,0)
z[c(1,5)] <- 1
z
## [1] 1 0 0 2 1
z[which.max(z)] <- 0
z
## [1] 1 0 0 0 1
z[z==0] <- 8
z
## [1] 1 8 8 8 1
- modifying elements of matrices (see Page 105), e.g.
m <- matrix(c(1,2,3,1,2,3,2,1,3),3,3)
m[m!=2] <- 0
m
## [,1] [,2] [,3]
## [1,] 0 0 2
## [2,] 2 2 0
## [3,] 0 0 0
Mat <- matrix(1:12,nrow=4,byrow=TRUE)
Mat <- Mat[-4,]
Mat
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
m[Mat>7] <- Mat[Mat>7]
m
## [,1] [,2] [,3]
## [1,] 0 0 2
## [2,] 2 2 0
## [3,] 0 8 9
- extracting elements from a list (see Pages 106-108)
L <- list(cars=c("FORD","PEUGEOT"),climate=c("Tropical","Temperate"))
L[["cars"]]
## [1] "FORD" "PEUGEOT"
L$cars
## [1] "FORD" "PEUGEOT"
L$climate
## [1] "Tropical" "Temperate"
Note that
- `\(L[2]\)` returns the second component of L as a list
- `\(L[[2]]\)` returns the second component of L as an element
- insering elements into a list (see Page 108)
Mat <- matrix(1:12,nrow=4,byrow=TRUE)
L <- list(12,c(34,67,8),Mat,1:5,list(10,11))
L[2]
## [[1]]
## [1] 34 67 8
class(L[2])
## [1] "list"
L[c(3,4)]
## [[1]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
##
## [[2]]
## [1] 1 2 3 4 5
L[[2]]
## [1] 34 67 8
class(L[[2]])
## [1] "numeric"
L[[5]][[2]]
## [1] 11
Additional Notes - Manipulating character strings - Page 109 #
(string <- c("one","two","three"))
## [1] "one" "two" "three"
as.character(1:3)
## [1] "1" "2" "3"
string1 <- c("a","B","bba","!")
nchar(string1)
## [1] 1 1 3 1
string1[nchar(string1)>2]
## [1] "bba"
string2 <- c("e","D")
paste(string1,string2)
## [1] "a e" "B D" "bba e" "! D"
paste(string1,string2,sep="-")
## [1] "a-e" "B-D" "bba-e" "!-D"
strsplit(c("05 Jan","06 Feb"), split=" ")
## [[1]]
## [1] "05" "Jan"
##
## [[2]]
## [1] "06" "Feb"
Relevant Exercises #
- 5.1, 5.3 5.6 - 5.19, 9.1-9.8, 9.10-9.12