Create a new .Rmd
file
Process the .Rmd
file
Anatomy
‘YAML’ header with information about the document
---
title: "Notes"
author: "Martin Morgan"
date: "6/30/2019"
output: html_document
---
Plain text ‘markdown’
#
: top level heading; ##
second level heading, …More help on markdown under “Help” -> "Markdown Quick Reference
Exercise To get going, leave the YAML header, but remove everything else.
library()
, install.packages()
, and BiocManager::install()
We’ll use the following libraries; attach them to your current R session.
library(readr)
library(dplyr)
If R responds
Error in library(readr) : there is no package called 'readr'
or similar, it means that the library needs to be installed. The
‘Bioconductor’ way of library installation (able to install CRAN
and Bioconductor packages) is to use the BiocManager
package. Make sure you have the BiocManager
package installed
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
then install the packages you need
BiocManager::install(c("readr", "dplyr"))
read_csv()
Find the data file “ALL-sample-sheet.csv”
pdata_file <- file.choose()
then input the data using read_csv()
pdata <- read_csv(pdata_file)
## Parsed with column specification:
## cols(
## .default = col_character(),
## age = col_double(),
## `t(4;11)` = col_logical(),
## `t(9;22)` = col_logical(),
## cyto.normal = col_logical(),
## ccr = col_logical(),
## relapse = col_logical(),
## transplant = col_logical()
## )
## See spec(...) for full column specifications.
If R responds
Error in read_csv() : could not find function "read_csv"
this means that the readr
package has not been attached to the R
session; remedy with library(readr)
.
Take a peak at the data to see what we have
pdata
## # A tibble: 128 x 22
## sample cod diagnosis sex age BT remission CR date.cr
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/19…
## 2 01010 1010 3/29/2000 M 19 B2 CR CR 6/27/2…
## 3 03002 3002 6/24/1998 F 52 B4 CR CR 8/17/1…
## 4 04006 4006 7/17/1997 M 38 B1 CR CR 9/8/19…
## 5 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1…
## 6 04008 4008 7/30/1997 M 17 B1 CR CR 9/27/1…
## 7 04010 4010 10/30/19… F 18 B1 CR CR 1/7/19…
## 8 04016 4016 2/10/2000 M 16 B1 CR CR 4/17/2…
## 9 06002 6002 3/19/1997 M 15 B2 CR CR 6/9/19…
## 10 08001 8001 1/15/1997 M 40 B2 CR CR 3/26/1…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## # `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## # `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## # relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>
This represents 22 variables collected on 128 samples from a classic microarray experiment. Some of the column names are cryptic; we’ll only focus on a few.
Add a top-level section title (start a line with #
) like
# Day 1: R / Bioconductor practical
Add notes about the key functions you’ve learned about so far, using a bulleted list like
Key functions:
- `file.choose()`: interactively choose a file location.
- `library(readr)`: attach the readr package to the current _R_ session
- `read_csv()`
Press the ‘Knit’ button, debugging any errors that occur.
Create a code chunk containing instructions for data input, using a
‘fence’ followed by {r}
```{r} library(readr) pdata_file = "path/to/file" pdata <- readr(pdata_file) pdata ```
Press the ‘Knit’ button. Note that the code is actually evaluated, and
when correct will import the data from the pdata_file
.
arrange()
Use arrange()
to order the rows from youngest to oldest individual
pdata %>% arrange(age)
## # A tibble: 128 x 22
## sample cod diagnosis sex age BT remission CR date.cr
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 08018 8018 8/27/1999 M 5 B3 CR CR 10/18/…
## 2 19017 19017 3/17/2000 M 14 T2 CR CR 6/15/2…
## 3 06002 6002 3/19/1997 M 15 B2 CR CR 6/9/19…
## 4 25003 25003 5/22/1998 M 15 B2 CR CR 8/4/19…
## 5 04016 4016 2/10/2000 M 16 B1 CR CR 4/17/2…
## 6 24010 24010 6/3/1997 F 16 B2 CR CR 8/11/1…
## 7 24019 24019 5/4/1999 M 16 B4 CR CR 7/28/1…
## 8 28001 28001 10/19/19… M 16 B3 REF REF <NA>
## 9 28044 28044 7/20/1999 M 16 B3 CR CR 9/20/1…
## 10 01007 1007 9/30/1998 F 16 T3 CR CR 11/30/…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## # `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## # `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## # relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>
If R responds with
> pdata %>% arrange(age)
Error in pdata %>% summarize(n = n()) : could not find function "%>%"
then we need to attach the dplyr
package, library(dplyr)
.
Arrange from oldest to youngest with
pdata %>% arrange(desc(age))
## # A tibble: 128 x 22
## sample cod diagnosis sex age BT remission CR date.cr
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr>
## 1 16004 16004 4/19/1997 F 58 B1 CR CR 7/15/1…
## 2 20002 20002 5/9/1997 F 58 B2 CR CR 8/19/1…
## 3 04007 4007 7/22/1997 M 57 B2 CR CR 9/17/1…
## 4 24017 24017 9/15/1998 M 57 B2 CR CR 12/7/1…
## 5 08012 8012 10/22/19… M 55 B3 CR CR 1/9/19…
## 6 28021 28021 3/18/1998 F 54 B3 CR DEAT… 5/22/1…
## 7 30001 30001 1/16/1997 F 54 B3 <NA> DEAT… <NA>
## 8 43007 43007 10/14/19… M 54 B4 CR CR 12/30/…
## 9 62002 62002 1/15/1998 M 54 B4 <NA> DEAT… <NA>
## 10 01005 1005 5/21/1997 M 53 B2 CR CR 8/6/19…
## # … with 118 more rows, and 13 more variables: `t(4;11)` <lgl>,
## # `t(9;22)` <lgl>, cyto.normal <lgl>, citog <chr>, mol.biol <chr>,
## # `fusion protein` <chr>, mdr <chr>, kinet <chr>, ccr <lgl>,
## # relapse <lgl>, transplant <lgl>, f.u <chr>, `date last seen` <chr>
Exercise How would you use arrange()
to order the samples from
youngest to oldest, with individuals of equal age
ordered by sex
?
summarize()
and group_by()
Let’s ask the average age of males and females. First, use
summarize()
and the n()
function to create a tibble with the
number of samples of each sex.
pdata %>% summarize(n = n())
## # A tibble: 1 x 1
## n
## <int>
## 1 128
We now know how many samples there are. Use group_by()
to tell
dplyr
to separately consider each sex
pdata %>% group_by(sex)
and then perform the summary operation…
pdata %>% group_by(sex) %>% summarize(n = n())
## # A tibble: 3 x 2
## sex n
## <chr> <int>
## 1 F 42
## 2 M 83
## 3 <NA> 3
Exercise What is the NA
sex? Use filter()
to identify the 3
records with is.na(sex)
.
Exercise Use arrange()
and desc()
to arrange the summary table
from most to least frequent.
Exercise Add another column ave_age
to summarize()
, using
mean()
to calculate the average age by sex. You’ll need to use the
na.rm = TRUE
argument to calculate an average age in the face of
missing values.
t.test()
, an untidy functionWe’d like to compare the average age of males and females in the study
using t.test()
. t.test()
takes a formula age ~ sex
(age
as a
function of sex
) describing the comparison that we’d like to
make. It also takes an argument data=
that contains the data we’d
like to perform the t-test on. Unlike functions we’ve encountered so
far where the data to be processed is the first argument, t.test()
expects the data as the second argument. To adapt t.test()
for use,
we need to explicitly indicate that the data should be the second
argument. One way of doing this is to use the special symbol .
to
represent the location of the incoming data, invoking t.test(age ~ sex, data = .)
:
pdata %>% t.test(age ~ sex, data = .)
Exercise Perform a t-test to ask whether there is evidence of
differences in ages between the sexes. How can we change the default
value of var.equal
to TRUE
? Is this appropriate?
Exercise (advanced) The return value of t.test()
doesn’t fit
well with tidy
data analysis, because it is a complicated object
that is not represented as a tibble
and hence cannot be computed
upon using the common tidy verbs. Write a simple function t_test()
that is more tidy-friendly, accepting data =
as it’s first argument,
using t.test()
internally to compute results, and returning a
tibble
containing results formatted for subsequent
computation. Explore the broom::tidy()
function as a way to
transform many base R objects into tidy-friendly data structures.
filter()
and %in%
Take a look at the mol.biol
column, using group_by()
and the
convenience function count()
pdata %>%
group_by(mol.biol) %>%
count()
## # A tibble: 6 x 2
## # Groups: mol.biol [6]
## mol.biol n
## <chr> <int>
## 1 ALL1/AF4 10
## 2 BCR/ABL 37
## 3 E2A/PBX1 5
## 4 NEG 74
## 5 NUP-98 1
## 6 p15/p16 1
These represent chromosomal events such as the presence of the BCR/ABL fusion gene, or NEG (standard) chromosomes.
Exercise Use filter()
to create a new data set bcrabl
that
contains all 22 columns but only the BCR/ABL and NEG rows. It will be
helpful to use the %in%
operator, which returns TRUE when an object
in the set on the left-hand side is present in the object on the
right-hand side.
c("a", "b", "c", "b", "a") %in% c("a", "c")
## [1] TRUE FALSE TRUE FALSE TRUE
Exercise Use t.test()
to ask whether there are age differences
between these molecular biologies.
Continue your markdown document by summarizing in a bulleted list the key functions covered in this section. Include examples of code inside fences, and make sure the code is evaluated when the “Knit” button is pressed.
ggplot()
Load the ggplot2
library.
library(ggplot2)
Plots are created by providing a source for the data to be displayed,
as well as ‘aesthetics’ aes()
describing the data to be used.
ggplot(pdata, aes(x = sex, y = age))
geom_boxplot()
This is enough to determine the overall dimensions of the display, but
we still need to add ‘geometric’ geom_*()
objects that display the
relationships of interest. Add a boxplot geom for an initial
visualization.
ggplot(pdata, aes(x = sex, y = age)) +
geom_boxplot()
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
geom_jitter()
Add a second layer that superimposes the actual data using
geom_jitter()
invoked so that the sex
value is displaced a random
amount about its horizontal location (width = .1
) while the age is
plotted at its actual vertical location (height = 0
).
ggplot(pdata, aes(x = sex, y = age)) +
geom_boxplot() +
geom_jitter(width = .1, height = 0)
## Warning: Removed 5 rows containing non-finite values (stat_boxplot).
## Warning: Removed 5 rows containing missing values (geom_point).
Exercise How might pdata
be filtered so that the samples with
unknown age are excluded, and hence there are only two categories on
the x-axis?
Exercise Are there other displays that convey both the summary of results and the actual data?
Create a short character vector of DNA sequences.
sequences <- c("AAATCGA", "ATACAACAT", "TTGCCA")
In base R we can ask about properties of these sequences and perform some operations
sequences
## [1] "AAATCGA" "ATACAACAT" "TTGCCA"
length(sequences)
## [1] 3
nchar(sequences)
## [1] 7 9 6
sequences[c(1, 3)]
## [1] "AAATCGA" "TTGCCA"
sample(sequences)
## [1] "TTGCCA" "ATACAACAT" "AAATCGA"
Base R has no notion of operations relevant to DNA sequences, e.g.,
reverseComplement(sequences)
fails. Likewise, we can name a variable anything, the semantic meaning of the variable name is not enforced by R
my_sequence <-
"All the world's a stage, And all the men and women merely players"
library(Biostrings)
Load the Biostrings library
library(Biostrings)
If R responds with
> library(Biostrings)
Error in library(Biostrings) : there is no package called 'Biostrings'
then it is necessary to install the package first (make sure you’ve spelt the package name correctly, including capitalization!)
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
library(Biostrings)
DNAStringSet()
Create a DNAStringSet
from our character vector
dna <- DNAStringSet(sequences)
dna
## A DNAStringSet instance of length 3
## width seq
## [1] 7 AAATCGA
## [2] 9 ATACAACAT
## [3] 6 TTGCCA
Exercise Does the object dna
support the operations illustrated
above for a character vector, especially length()
, nchar()
, [
,
and sample()
?
Exercise Prove to yourself that at least some other useful,
DNA-specific, functions exist, e.g., reverse()
and
reverseComplement()
.
Exercise What happens when you try to create a DNAStringSet()
from an object such as my_sequence
, defined above, that does not
contain a DNA sequence? Warning: the message is quite cryptic, can you
provide a ‘human’ translation?
Exercise Why does DNAStringSet("ACGTMRW")
not create an error,
since MRW
are not standard nucleotides? For hints, see the section
’The DNA alphabet:" on the help page ?DNAString
.
methods()
, help, and browseVignettes()
The function DNAStringSet()
returns an object that has a
particular class
class(dna)
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"
Associated with the class are a series of methods that operate on the class.
Exercise Discover many (unfortunately, not all) methods acting on
an object of class DNAStringSet
using methods(class = "DNAStringSet")
. Verify that reverseComplement
is among those
methods.
Help pages describing a particular method can be found using ?
, with
the search query quoted and with tab-completion providing hints on
what the appropriate help topic is.
Exercise Find the help page for the reverseComplement
method
operating on a DNAStringSet
object, using
?"reverseComplement,DNAStringSet-method"
.
Help pages provide a description of the technical details required for creating classes and using methods. Vignettes provide a more narrative description of overall package use.
Exercise Use browseVignettes(package = "Biostrings")
to see
vignettes available for this package; explore a few vignettes to get a
sense of possible content.
readDNAStringSet()
It is unlikely that we would enter 1000’s of DNA sequences ‘by
hand’. Instead, we might read the data from a standard file
format. For DNA sequences the standard file format is often a ‘FASTA’
file, sometimes abbreviated with an extension .fa
and often
compressed with an additional extension .fa.gz
. An example of a
FASTA file containing DNA sequences of the 2000bp upstream nucleotides
of all genes annotated in the Drosophila melanogaster dm3
genome
build, is distributed with the Biostrings package. Here’s the path
to the FASTA file.
fa_file <-
system.file(package="Biostrings", "extdata", "dm3_upstream2000.fa.gz")
Exercise Take a peak at the structure of a FASTA file by looking at the first five lines.
readLines(fa_file, 5)
## [1] ">NM_078863_up_2000_chr2L_16764737_f chr2L:16764737-16766736"
## [2] "gttggtggcccaccagtgccaaaatacacaagaagaagaaacagcatctt"
## [3] "gacactaaaatgcaaaaattgctttgcgtcaatgactcaaaacgaaaatg"
## [4] "tttttgtgcttttcgaacaaaaaattgggagcagaatattggattcgctt"
## [5] "ttttcgcatcgaatatcttaagggagcaagggaagaaccatctagaataa"
The first line is an identifier, containing information about the gene
NM_078863
as well as the genomic coordinates of the sequence
chr2L:16764737-16766736
. The next lines are the DNA sequence. After
a certain number of lines, a new record starts.
tail(readLines(fa_file, 44), 5)
## [1] "cacgcacaccgatcgtcgaatcgaaaagctttcggggtcttacgggatcc"
## [2] "atgggtatcaagttgccccgtataaaaggcaagtttaccggttgcacggt"
## [3] ">NM_001201794_up_2000_chr2L_8382455_f chr2L:8382455-8384454"
## [4] "ttatttatgtaggcgcccgttcccgcagccaaagcactcagaattccggg"
## [5] "cgtgtagcgcaacgaccatctacaaggcaatattttgatcgcttgttagg"
We could fairly ‘easily’ write our own parser for this format, but this would be error-prone and unnecessary.
Exercise Input the file
dna <- readDNAStringSet(fa_file)
dna
## A DNAStringSet instance of length 26454
## width seq names
## [1] 2000 GTTGGTGGCCCACCAGTGC...GTTTACCGGTTGCACGGT NM_078863_up_2000...
## [2] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201794_up_2...
## [3] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201795_up_2...
## [4] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201796_up_2...
## [5] 2000 TTATTTATGTAGGCGCCCG...CGGAAAGTCATCCTCGAT NM_001201797_up_2...
## ... ... ...
## [26450] 2000 ATTTACAAGACTAATAAAG...ATTAAATTTCAATAAAAC NM_001111010_up_2...
## [26451] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001015258_up_2...
## [26452] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001110997_up_2...
## [26453] 2000 GATATACGAAGGACGACCT...TTTGAGTTGTTATATATT NM_001276245_up_2...
## [26454] 2000 CGTATGTATTAGTTAACTC...AAGTGTAAGAACAAATTG NM_001015497_up_2...
Exercise Query the object for basic properties, e.g., it’s
length()
and that the number of characters in each sequence is 2000
unique(nchar(dna))
.
letterFrequency()
: calculate GC contentExercise Use letterFrequency()
to determine GC content of each
of the DNA sequences in dna
. The letters
argument should be
"GC"
; as.prob = TRUE
returns values between 0 and 1. The data is
returned as a matrix with 1 column.
Exercise Plot the distribution of GC frequencies in the dna
object using base graphics hist()
and plot(density())
, and using
ggplot()
.
Although Bioconductor emphasizes formal objects like DNAStringSet
rather than tibble
-like data frames, some of the ways one interacts
with tidy data can be applied to Bioconductor objects. For instance,
the GC content example might be written in ‘traditional’ form as
gc <- letterFrequency(dna, "GC", as.prob = TRUE)
but could be written using pipes and to reesult in a tibble for easier down-stream manipulation
gc <-
dna %>%
letterFrequency("GC", as.prob = TRUE) %>%
tibble::as_tibble()
gc
## # A tibble: 26,454 x 1
## `G|C`
## <dbl>
## 1 0.378
## 2 0.43
## 3 0.43
## 4 0.43
## 5 0.43
## 6 0.43
## 7 0.43
## 8 0.43
## 9 0.43
## 10 0.43
## # … with 26,444 more rows
Exercise Revise other code, above, to follow this style of analysis. Are there merits / limitations of base verses ‘tidy’ approaches?
Exercise Are there reasons why we did not just put our sequences
vector of DNA sequences into a tibble
, and go from there?
Add further notes to the markdown document summarizing the functions and objects you’ve learned in this practical.
Finish your markdown document with a fenced section that evaluates the
sessionInfo()
command.
```{r} sessionInfo() ```
Here are the packages used for this practical:
sessionInfo()
## R version 3.6.0 Patched (2019-04-26 r76431)
## Platform: x86_64-apple-darwin17.7.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Users/ma38727/bin/R-3-6-branch/lib/libRblas.dylib
## LAPACK: /Users/ma38727/bin/R-3-6-branch/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Biostrings_2.53.0 XVector_0.25.0 IRanges_2.19.10
## [4] S4Vectors_0.23.17 BiocGenerics_0.31.4 ggplot2_3.2.0
## [7] dplyr_0.8.2 readr_1.3.1 BiocStyle_2.13.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 pillar_1.4.2 compiler_3.6.0
## [4] BiocManager_1.30.5.1 zlibbioc_1.31.0 tools_3.6.0
## [7] zeallot_0.1.0 digest_0.6.19 evaluate_0.14
## [10] tibble_2.1.3 gtable_0.3.0 pkgconfig_2.0.2
## [13] rlang_0.4.0 cli_1.1.0 rstudioapi_0.10
## [16] yaml_2.2.0 xfun_0.8 withr_2.1.2
## [19] stringr_1.4.0 knitr_1.23 vctrs_0.1.0
## [22] hms_0.4.2 grid_3.6.0 tidyselect_0.2.5
## [25] glue_1.3.1 R6_2.4.0 fansi_0.4.0
## [28] rmarkdown_1.13 bookdown_0.11 purrr_0.3.2
## [31] magrittr_1.5 codetools_0.2-16 backports_1.1.4
## [34] scales_1.0.0 htmltools_0.3.6 assertthat_0.2.1
## [37] colorspace_1.4-1 labeling_0.3 utf8_1.1.4
## [40] stringi_1.4.3 lazyeval_0.2.2 munsell_0.5.0
## [43] crayon_1.3.4