Chapter 3 High throughput sequencing

Learning Objectives

The aim of this chapter is to present high throughput sequencing by going through one of its key applications: RNA-seq. It is a highly sensitive and accurate tool that has revolutionised the study of transcriptomes. The objective is to give an overview of the technology.

RNA Sequencing Data: Hitchhiker’s Guide to Expression Analysis(⊕Van den Berge et al. 2019Van den Berge, Koen, Katharina Hembach, Charlotte Soneson, Simone Tiberi, Lieven Clement, Michael Love, Rob Patro, and Mark Robinson. 2019. “RNA Sequencing Data: Hitchhiker’s Guide to Expression Analysis.” Annual Review of Biomedical Data Science 2 (July): 139–73. https://doi.org/10.1146/annurev-biodatasci-072018-021255.) is an interesting review describing RNA-seq technology and data analysis.

We will pass through the different steps starting from the library preparation and the sequencing process. We will explore the resulting fastq files and the downstream processing pipeline including quality control, alignment and counting steps.

3.1 RNA-seq library preparation

The starting point of a RNA-seq experiment is the cDNA library preparation. RNA is first isolated from cells and purified to remove ribosomal RNA which constitute the majority of total RNA. This can be done by extracting poly(A) RNA or by depleting rRNA. Purified RNA is then fragmented, reverse transcribed and adapters are ligated to both extremities to allow for amplification and sequencing.

3.1.1 Stranded versus unstranded libraries

The library preparation presented above corresponds to an unstranded protocol.

As both cDNA strands will be amplified for sequencing, about half of the reads will be sequenced from the first cDNA strand, and half of the reads will be sequenced from the second strand cDNA. But we won’t know which strand was sequenced! So we lose the information about the orientation of the transcript.

This can be avoided by using a stranded protocol. In this case, dUTPs are added during the 2nd strand cDNA synthesis and the newly synthetized cDNA strand is degradated after the ligation step. This allows to infer the transcript orientation because only the first cDNA strand will be sequenced. Indeed, if the sequence of the read corresponds to the the sense strand of DNA, it means that the gene was originally transcribed in a sense orientation. On the contrary, if the read is complementary to the sense strand of DNA, it means that it was originally transcribed in antisense.

Strand-specificity leads to lower number of ambiguous reads.

3.2 Sequencing

3.2.1 SE vs PE sequencing

Once the library is ready, molecules from the library are sequenced in a high throughput manner to obtain short sequences from one end (single-end sequencing) or both ends (pair-end sequencing). Single-end sequencing is cheaper, but Paired-end sequencing improves the ability to localise the fragment in the genome and resolve mapping close to repeat regions, resulting in less multimapping reads.

3.2.2 cluster amplification

The sequencing is done by a cluster amplification process.

The library preparation is hybridized to a flow cell coated with immobilized oligonucleotides that serve as support to hold the DNA strands in place during sequencing. The templates are copied from the hybridized primers by 3’ extension using a high-fidelity DNA polymerase. The original templates are denatured, leaving the copies immobilized on the flow cell surface.

Immobilized DNA template copies are then amplified by bridge amplification. The templates loop over to hybridize to adjacent oligonucleotides. DNA polymerase copies the templates from the hybridized oligonucleotides, forming dsDNA bridges, which are denatured to form two ssDNA strands. These two strands loop over and hybridize to adjacent oligonucleotides and are extended again to form two new dsDNA loops. The process is repeated on each template by cycles of denaturation and amplification to create millions of individual, dense clonal clusters containing ~2,000 molecules.

Each cluster of dsDNA bridges is denatured, and the reverse strand is removed by specific base cleavage, leaving the forward DNA strand. The 3’-ends of the DNA strands and flow cell-bound oligonucleotides are blocked to prevent interference with the sequencing reaction. The sequencing primer is hybridised to the Illumina adapter. Clusters are ready for sequencing.

The sequencing process is done by synthesis. A polymerase adds a fluorescently tagged dNTP to the DNA strand (only one base is able to be added per round due to the fluorophore acting as a blocking group, but the blocking group is reversible). Each of the four bases has a unique fluorophore, and after each round, the machine records which base was added. The fluorophore is then washed away and the process is repeated.

For a dynamic view of the cluster amplification process, watch the Illumina video.

3.3 fastq files

The sequence data generated from the sequencer are received in text files called fastq files. For a single-end run, one fastq file is created for each sample. For a paired-end run, two separated fastq files are generated, each containing sequences from one end.

What does a FASTQ file look like?

Each entry in a fastq file consists of 4 lines:

A sequence identifier (by convention preceded by ‘@’) with information about the sequencing run
The sequence
A delimiter, always starting by the sign ‘+’
The base call quality scores. These are Phred scores, encoded using ASCII characters to represent the numerical quality scores. Each character is assigned a quality score between 0 and 40. Quality scores represent the probability that the corresponding nucleotide call is correct.

► Question

What do you think of the quality of the last sequence presented in the fastq file above?

3.4 Processing pipeline

The raw reads from fastq files will need to pass through a different tools to yield ultimately a count table, representing a snapshot of individual gene expression levels. The execution of this set of tools in a specified order is commonly referred to as a pipeline.

You can download the whole script that we will use to process the RNA-seq data by clicking here

3.4.1 Quality control

Of course, no one will analyse the sequences from the fastq file one by one to check their quality… Rather, a software called FastQC can be used. Then, another program such as Trimmomatics can help to clean the reads.

FastQC

FastQC allows to analyse different features of the reads and generates a report that includes several diagnostic plots to highlight any problem the data may have.

FastQC is run from the command line:

$ fastqc SampleName.fastq

► Question

Try to interpret the different plots of this fastQC report from a real RNA-seq fastq file.

Another report corresponding to bad quality data can be used as a point of comparison.

► Solution

1. Per base sequence quality

This graph shows an overview of the range of quality values (Phred scores) across all bases at each position. In this example, the sample contains reads that are 150 bp long. For each position, a boxplot is drawn with the median value, represented by the red line. The mean quality is represented by the blue line. The background of the graph divides the y-axis into very good quality scores (green), scores of reasonable quality (orange), and reads of poor quality (red). Here the quality of reads is very good.

This graph would for example represent low quality reads:

2. Per tile sequence quality:

Flowcells are divided into tiles. The graph allows to look at the quality scores from each tile across the reads to see if there is a loss in quality associated with one part of the flowcell. The plot shows the deviation from the average quality for each tile. Cold colours being positions where the quality was at or above the average for that base in the run, and hotter colours indicate that a tile had worse qualities than other tiles for that base. Low scores are often found around the edges, but hot spots can also occur in the middle (due to an air bubble for example), however these should represent only a small percentage of the total sequences. In this example no tile shows poor quality.

This graph would for example show some lower quality tiles:

3. Per sequence quality scores

It represents a density plot of average quality per read. The distribution of average read quality should show a tight peak in the upper range of the plot. It could report a subset of sequences with low quality values, however these should represent only a small percentage of the total sequences.

4. Per base sequence content

It plots the proportion of each base position over all the reads. Typically, we expect to see each base roughly 25% of the time at each position so the lines in this plot should run parallel with each other. Experience has shown that RNA-seq libraries often have a selection bias in around the first 12bp of each run. This is due to a biased selection of random primers, but doesn’t represent any individually biased sequences. This is a typical plot obtained for RNA-seq reads.

When Analysing DNA, the figure could look like this:

5. Per sequence GC content

It gives the GC distribution across the reads.

This plot makes more sense for whole genome shotgun sequencing, as the overall GC content should reflect the GC content of the underlying genome. (Note that since the software doesn’t know the the GC content of the genome, the theorical distribution is calculated from the observed data).

An unexpected distribution could indicate a contaminated library.

For RNA sequencing, this plot makes less sense, as over-represented sequences might generate an unexpected distribution.

6. Per base N content

If a sequencer is unable to make a base call with sufficient confidence then it will replace the conventional base call with an ‘N’. This graph plots the percent of times that ‘N’ occurs at a position in all reads. If there was an increase at a particular position, it could indicate that something went wrong during sequencing.

7. Sequence Length Distribution

The distribution of sequence lengths of all reads in the file. Here we see that most reads are 150 pb long but some reads are shorter.

8. Sequence Duplication Levels:

It gives the distribution of duplicated reads. The blue line corresponds to the percentage of duplicated reads among total number of reads. The red line corresponds to the percentage of duplicated reads among total number of distinct reads.

In a highly diverse library from whole genome shotgun sequencing, almost all the reads are expected to be unique, appearing only 1 time in the sequence data. A high level of duplication could indicate biased PCR enrichment, or presence of contaminant.

In contrast, in RNA-seq, there will likely be some very highly abundant transcripts for which it is expected to observe duplicated reads. Here the data was flagged as ‘failed’ by FastQC even though duplication was expected in this case.

9. Overrepresented sequences

FastQC lists here all of the sequence which make up more than 0.1% of the total. Finding that a single sequence is highly overrepresented in the set means that it is either highly biologically significant, or that the library is biased or contaminated. Here it is of course not relevant as it just corresponds to a sequence full of ‘N’.

10. Adapter Content

This graph indicates if and where adapter sequences occur in the reads.

If residual adapters were present we could see something like this:

Trimmomatics

FastQC gives a global vision the quality of the sample. If it is not optimal, it is advisable to use Trimmomatics software to filter poor quality reads and trim residual adapters.

Trimmomatics has a variety of options to clean the reads. Read the manual for more information. However, a command for Trimmomatics could look something like the command below. In this case it would perform adapter removal, and perform a scan of the read with a 10-base wide sliding window, cutting when the average quality per base drops below 20.

$ java -jar trimmomatic.jar \
  PE \                                  # paired-end
  -threads 4 \                          # number of threads
  SampleName_1.fastq \                  # fastq file for fw reads
  SampleName_2.fastq  \                 # fastq file for rev reads
  SampleName_1_clean.fastq \            # clean fastq for fw reads
  SampleName_1_unpaired.fastq \         # fastq for unpaired remaining fw reads
  SampleName_2_clean.fastq \            # clean fastq for rev reads
  SampleName_2_unpaired.fastq \         # fastq for unpaired remaining rev reads
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \  # cut adapter from the read.
  SLIDINGWINDOW:10:20                   # sliding window trimming

► Question

How could you check that the trimming performed well?

► Solution

Once samples passed through Trimmomatics, they should perform better on the quality tests run by FastQC. You could re-run FastQC on your clean fastq files to see whether adapters were correctly removed and if the per base sequence quality is higher after trimming…

$ fastqc SampleName_1_clean.fastq

3.5 Alignment

Next step is to map the reads to determine where in the genome the reads originated from. Many different aligners are available such as HISAT2, STAR, BWA, Salmon, Subread… There is no gold standard, but some tools are better suited for particular NGS analyses (⊕Giacomo et al. 2017Giacomo, B, K E Hayer, E J Kim, B Di Camillo, G A FitzGerald, and G R Grant. 2017. “Simulation-Based Comprehensive Benchmarking of RNA-Seq Aligners.” Nature Methods 2 (14): 135. https://doi.org/10.1038/nmeth.4106.)… In the case of RNA-seq, it is important to use an aligner able to handle spliced reads.

Here we show an example of what the command could look like when using HISAT2 aligner, launched with default parameters. In this basic configuration, mandatory parameters are of course the fastq file(s) and the sequence of the genome on which the reads have to be aligned. In practical, the genome’s sequence is not given as it is but in an indexed form 44 Genome indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves considerable time during alignment. Frequently used genome (human, mouse, rat…) already have indexed versions that can be downloaded directly. When working with another reference genome, a specific indexed genome has to be built.. Of course many of other parameters can be fine tuned (see the manual for more details). The alignment output file data is a sam file (see below).

$ hisat2 \
  -p 6 \                        # number of threads
  -x genome_index \             # index for the reference genome
  -1 SampleName_1_clean.fastq \ # clean fastq_file for fw reads
  -2 SampleName_2_clean.fastq \ # clean fastq_file for rev reads
  -S SampleName.sam             # sam file name

It is recommended to pay attention to the HISAT2 report to check the alignment rate. For a PE-end alignment it could look like this:

SAM format

The SAM format is a generic format for storing large nucleotide sequence alignments. In a SAM file, each entry gives the alignment information for a single read. Each entry has 11 mandatories fields for essential mapping information and a variable number of other fields for aligner specific information. An example of few entries from a SAM file is displayed below.

head wsbim2122_data/RNAseq_pipeline/example.sam

## DRR016707.26395247   147 1   19626495    60  101M    =   19626490    -106    GTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAAGTTGTAAGG   DDDDDDDDDDDDDEEEEEEEFFFFFFHCHHHJJJHHJJJJJJJJJJJJIHFJJIHJJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJHHHHHFFFFFCCC   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.26395247   99  1   19626490    60  101M    =   19626495    106 CTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAAGTTG   CCCFFFFFHHHHHJIJJJJIJJJJJJJJJJIJJJJJJJJJJJJJJJHHIHIJJJIJJHIJJJIIIIIIJGIJJJJDHJJJHHHHHHHFFFFFFEDACEEDD   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.34969377   163 1   19626487    60  101M    =   19626493    107 ATTCTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAAG   CC@FFFFFHFHFFHGGJCIFJGHIIIIGIIIIJIIJJJIGIIIIIIGII?:BFHGHIIIIGEHHHGIIDGGH@GGIGHJGHHHHCEHDEFDDFFEEECCEC   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.34969377   83  1   19626493    60  101M    =   19626487    -107    CTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAAGTTGTAA   CDDDDDDDDDDDDECEEEEEFFFFD?=HHHHGIIGEGIGIIIHIGDIHHGGGIIHGHIIIIGIIIGIGEIIGGGIIIIGIIIIGHDIIFHHHHFFBDD?C@   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.42561021   161 1   19626495    60  100M1S  =   19626494    0   GTAACAATGTTATCAGTAATGCTTTAAACTAAAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAAGTTGTAAGA   ?8?B1:BDD,,2<:C<2A<<,+A?CBFFAE<+<+2A?EEBD99:D<*?B:*:D9B:DD/BDED@)=8=B;@=)).=(=@;7A)7.==>;3);@AAA>A@3;   AS:i:-8 XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:30C0C68    YT:Z:UP NH:i:1
## DRR016707.42561021   81  1   19626494    60  101M    =   19626495    0   TGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGACACCAAGTCTGTTTCTTGTTTTGTATTTTCGCTCTGGAAGTTGTAAG   (9(??==3??>==;5(55((6.((9>;..;67>)>).5(A;;>=/))8.=>9)((?AB?92=00*;=::1)=:3=74=A7<)0<+<AAAB?=<+4AA<;==   AS:i:-5 ZS:i:-19    XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:54A29T16   YT:Z:UP NH:i:1
## DRR016707.43465443   163 1   19626483    60  101M    =   19626486    104 TGCCATTCTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTG   B@@FAFFFHHHHHHIIIIJJJHJJIJJIIJJJJJJJFJJJJJJJJJJJJJHIJFHJJIJJJIJJIIJIJJHIHHIIIJJJJJHEHHHFEFFFFFEDEEEEE   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:-1 YT:Z:CP NH:i:1
## DRR016707.43465443   83  1   19626486    60  100M1S  =   19626483    -104    CATTCTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGGAG   DDDDDDCDDDDCCDEEEEDFFFFFDBHHHHHGJJGHDJIJJIGIHJGGJJJJJJJJJJJJJHG?JJJJJJJJJJJIIJJJJIJJJJJIGHGHHFFFFFCCC   AS:i:-1 ZS:i:-12    XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.64407509   147 1   19626484    60  101M    =   19626484    -101    GCCATTCTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGG   DDDDDCDDDDDDEEDEEEEFFFFFFHHHHHHJJIJIGHFIHJJJIHFIHHFJJJJJIJJJJJJJIHJJJJJIJJJJJJJJJJJJJJJJHHHHHFFFFFBB@   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1
## DRR016707.64407509   99  1   19626484    60  101M    =   19626484    -101    GCCATTCTTCTGTAACAATGTTATCAGTAATGCTTTAAACTCCAGCACCTGGTTATGCATTTGAAACCAAGTCTGTTTCTTGTTTTGTATTTTCTCTCTGG   CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJIJJJJJJIJJDGIJJJJJJJJCGIJJJIIHHJJJHHIJJJHHHHHDFFFFFFEEEEEED   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101    YS:i:0  YT:Z:CP NH:i:1

Field 1: Sequence ID

Field 2: SAM Flag. It gives information about the read (is it paired, aligned, is the mate read properly aligned…)

Field 3: Chromosome where alignment occurs

Field 4: Position of alignment

Field 5: Mapping quality score (read uniquely mapped (60), multiply mapped (1) or unmapped (0))

Field 6: CIGAR string. It is a representation of alignment indicating positions of match/mismatch/deletion/insertion compared to the reference.

Field 7: Reference name of the mate (Set to ‘=’ if the mate’s reference sequence is the same as this alignment’s)

Field 8: Position of mate

Field 9: Inferred fragment length

Field 10: Read sequence (reverse-complemented if aligned to the reverse strand)

Field 11: Base quality (Phred score)

Field 12: Optional. See hisat2 manual for more information.

BAM format

The BAM file is a compressed binary version of SAM file. SAM files can be easily converted into BAM files using samtools, and it is sometimes mandatory to sort the SAM/BAM by chromosomal coordinates for downstream applications.

samtools view -Sb SampleName.sam | samtools sort > SampleName.bam

Alignment visualisation

Alignements described in BAM files can easily be viewed using IGV (Integrative Genomics Viewer), a very useful interactive tool for the visual exploration of genomic data.

Visualising the alignments is not part of the processing pipeline but sometimes it can be useful..

IGV requires that BAM files have an associated index file55 Indexing a BAM file allows IGV to quickly extract alignments overlapping particular genomic regions. Samtools can be used to index the BAM file. The following command would create the indexed SampleName.bam.bai file.

samtools index SampleName.bam

Bam file can be loaded in IGV66 Note that the index file must reside in the same directory as the bam file that it indexes, and it must have the same basename. Below is an example of visualisation of a bam file using IGV, focusing on CDKN1A gene.

Loading a BAM file creates 3 associated tracks:

Coverage Track to view depth of coverage
Splice Junction Track which provides an alternative view of reads spanning splice junctions
Alignment Track to view individual aligned reads

3.6 Counting

Now comes the quantification step. It can be performed with several tools such as htseq-count or FeatureCounts. We will describe the latter as it runs much faster. It was developed for counting reads to genomic features such as genes, exons, promoters and genomic bins. FeatureCounts takes as input SAM/BAM files and an annotation file (such as a gtf file) containing chromosomal coordinates of features.

The Gene Transfer Format (GTF) is a widely used format for storing gene annotations. Human and mouse GTF files can be downloaded easily from Ensembl or from the UCSC table browser. The first few lines of gtf annotation file for human genome assembly GRCh38 look like this:

FeatureCounts will count the number of uniquely mapped reads assigned to features (e.g. an exon) or meta-features (e.g. a gene) of the gtf file. When summarizing reads at gene level, read counts obtained for exons belonging to the same gene will be added up to yield the read count for the corresponding gene. Note that an exon-spanning read will be counted only once for the corresponding gene even if it overlaps with more than one exon.

FeatureCounts could be launched with the following command. As usual, many of other parameters can be fine-tuned (see the manual for more details).

featureCounts \
-p \                        # paired-end
--countReadPairs \          # fragments counted instead of reads
-a annotations.gtf \        # name of the annotation file
-t exon \                   # feature type in GTF annotation
-g gene_id \                # Meta-features used in GTF annotation
-s 0 \                      # strand-specificity
-o SampleName_counts.tsv \  # Name of the output file
SampleName.bam              # bam file

Here is an example of featurecounts output file. The last column gives the raw counts of each gene in that sample.

counts_result <- read_tsv(file = "wsbim2122_data/count_data/sample1_counts.tsv.gz")
datatable(head(counts_result))

► Question

Inspect the count table.

How many genes are tested ?
How many reads were assigned to genes?
Inspect the counts. What is the maximum value? How many genes have zero counts ?
Plot the distribution of counts. How does it look like?

► Solution

Inspect the count table.

How many genes are tested ?

nrow(counts_result)

## [1] 58735

How many reads were assigned to genes?

counts_result <- counts_result %>%
  rename(counts = "../processed_data/bam/sample1.bam")
sum(counts_result$counts)

## [1] 10527830

Inspect the counts. What is the maximum value? How many genes have zero counts ?

summary(counts_result$counts)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    179.2      6.0 143371.0

counts_result %>%
  filter(counts == 0) %>%
  nrow()

## [1] 36035

Plot the distribution of counts. How does it look like?

counts_result %>%
  ggplot(aes(x = log2(counts + 1))) +
  geom_histogram(bins = 20)

Note that a “pseudo-count” was added before the log2 transformation of the counts, to avoid elimination of counts equal to zero.

3.7 Preparing the count matrix

The aim of an RNA-seq experiment is often to compare different types of cells, or to compare treated cells to control cells, to see if some genes are up- or down- regulated.

Once fastq files from each sample have been processed into raw counts, results can be merged in a single count matrix, were each column represents a sample, and each line represents a gene. This count matrix will be the starting point of a differential expression analysis.

##                 sample1 sample2 sample3 sample4 sample5 sample6
## ENSG00000223972       0       0       0       0       0       1
## ENSG00000227232      14      28      17      40      16      13
## ENSG00000278267       8       4       3       1       1       6
## ENSG00000243485       0       0       0       0       0       0
## ENSG00000284332       0       0       0       0       0       0
## ENSG00000237613       0       0       0       0       0       0

► Question

Imagine you had initially 6 samples (3 control and 3 treated samples) that you have processed independently giving the 6 counts files generated by FeatureCounts located in the folder wsbim_data/count_data/. Use these files to create a count matrix.

► Solution

samples <- list.files("wsbim2122_data/count_data",
                      pattern = "*tsv.gz",
                      full.names = TRUE)
samples

## [1] "wsbim2122_data/count_data/sample1_counts.tsv.gz"
## [2] "wsbim2122_data/count_data/sample2_counts.tsv.gz"
## [3] "wsbim2122_data/count_data/sample3_counts.tsv.gz"
## [4] "wsbim2122_data/count_data/sample4_counts.tsv.gz"
## [5] "wsbim2122_data/count_data/sample5_counts.tsv.gz"
## [6] "wsbim2122_data/count_data/sample6_counts.tsv.gz"

counts <- read_tsv(file = samples[1]) %>%
    select(Geneid, ends_with('.bam'))

for (sample in samples[-1]){
  tmp <- read_tsv(sample) %>%
    select(ends_with('.bam'))
  counts <- cbind(counts, tmp)
}

names(counts) <- sub(pattern = ".bam$", '', names(counts))
names(counts) <- sub(pattern = "../processed_data/bam/", '', names(counts))
head(counts)

##            Geneid sample1 sample2 sample3 sample4 sample5 sample6
## 1 ENSG00000223972       0       0       0       0       0       1
## 2 ENSG00000227232      14      28      17      40      16      13
## 3 ENSG00000278267       8       4       3       1       1       6
## 4 ENSG00000243485       0       0       0       0       0       0
## 5 ENSG00000284332       0       0       0       0       0       0
## 6 ENSG00000237613       0       0       0       0       0       0

As differential expression analysis addresses relatively complex statistical concepts, it will be covered in the next chapter.

Page built: 2025-11-07 using R version 4.5.0 (2025-04-11)

Chapter 3 High throughput sequencing

3.1 RNA-seq library preparationCopy link

3.1.1 Stranded versus unstranded librariesCopy link

3.2 SequencingCopy link

3.2.1 SE vs PE sequencingCopy link

3.2.2 cluster amplificationCopy link

3.3 fastq filesCopy link

3.4 Processing pipelineCopy link

3.4.1 Quality controlCopy link

3.5 AlignmentCopy link

3.6 CountingCopy link

3.7 Preparing the count matrixCopy link

3.1 RNA-seq library preparation

3.1.1 Stranded versus unstranded libraries

3.2 Sequencing

3.2.1 SE vs PE sequencing

3.2.2 cluster amplification

3.3 fastq files

3.4 Processing pipeline

3.4.1 Quality control

3.5 Alignment

3.6 Counting

3.7 Preparing the count matrix