Identify and remove contaminants
Here, we apply the decontam R package to identify and remove contaminants from the metabarcoding feature table (based on the decontam vignette).
Alongside with the biological samples, the metabarcoding workflow needs to include also control samples to assess the potential contaminations due to the sensitivity of the DNA-based methods. The control sample can include field controls (e.g., a sample that should contain no target DNA and is exposed to the same field conditions and handling as real samples), blank DNA extractions (i.e., no sample substrate added to DNA extraction tubes) and blank PCR reactions (i.e., no reaction with template DNA).
In many cases, even if the control samples appear to be clean, they may still contain some sequences in the final feature table. For smaller projects, including only few samples, the decontamination process may be easy via visual inspections. However, for larger projects, including many samples, the decontamination process may be more challenging, and here tools such as decontam can be very helpful.
Input data
The input data for the decontamination process include a metabarcoding feature table (ASV or OTU table, with control samples included) and sample metadata. Then, later, the identified contaminated features can be removed also from the taxonomy table and fasta file.
The metabarcoding feature table (per sequencing run/per batch processed simultaneously) with sequence abundances (example):
sample1 |
sample2 |
sample3 |
s_ExtrBlank |
s_FiledBlank |
s_pcrBlank |
|
OTU_01 |
4290 |
40550 |
3902 |
0 |
0 |
0 |
OTU_02 |
550 |
34501 |
2 |
0 |
0 |
0 |
OTU_03 |
0 |
0 |
0 |
136 |
7 |
201 |
OTU_04 |
3061 |
0 |
3465 |
1 |
0 |
0 |
… |
… |
… |
… |
… |
… |
… |
control samples
Samples s_ExtrBlank, s_FiledBlank and s_pcrBlank are control samples for DNA extraction, field blank and PCR blank, respectively. They are used to assess the potential contaminations.
In the below script, this table is represented as tab-delimited OTU_table.txt file.
The sample metadata (example):
DNA_concentration |
sample_type |
seq_counts |
|
sample1 |
19.3 |
true_sample |
133902 |
sample2 |
20.1 |
true_sample |
155423 |
sample3 |
20 |
true_sample |
250432 |
s_ExtrBlank |
0.12 |
control_sample |
247 |
s_FiledBlank |
1.3 |
control_sample |
34 |
s_pcrBlank |
0.19 |
control_sample |
325 |
… |
… |
… |
… |
The DNA_concentration is optional and can be included if available. This represents the DNA concentration of the sample
prior pooling the samples into a single library for sequencing.
The sample_type indicates the type of the sample: true_sample or control_sample.
The seq_counts represents the number of sequences assigned to the sample.
Note
sample_type and seq_counts columns can be generated from the metabarcoding feature table (see below).
So, essentially, no specific metadata file is required prior decontam if DNA concentration is not available.
Decontamination process
The decontamination process is performed using the decontam R package. Here, we do not use DNA concentration data, but only sequence counts and sample type (true_sample or control_sample). The procedure follows “prevalence” method: the presence/absence of each feature in true samples is compared to the presence/absence in control samples to identify contaminants. More information about that here: decontam vignette and in the decontam paper.
Note
Below workflow is for a single sequencing run/per batch processed simultaneously. If there are multiple sequencing runs with corresponding control samples, then those sequencing runs should be decontaminated independently.
1#!/usr/bin/Rscript
2
3# install decontam package if not already installed
4if (!requireNamespace("decontam", quietly = TRUE)) {
5BiocManager::install("decontam")
6} else {
7cat("decontam is already installed, version",
8 as.character(packageVersion("decontam")), "\n")
9}
10
11# load packages
12library(decontam)
13library(dplyr)
14library(ggplot2)
15set.seed(1)
16
17# read OTU table
18# tab-delimted txt file, samples as columns, OTUs in rows
19OTU_table = read.table("OTU_table.txt",
20 header = TRUE,
21 sep = "\t",
22 row.names = 1)
23
24#------------------------#
25### Make metadata file ###
26#------------------------#
27# add seq_counts columns per sample
28seq_counts <- colSums(OTU_table)
29
30metadata <- data.frame(
31sample_id = names(seq_counts),
32seq_counts = seq_counts
33)
34
35# add sample_type column
36metadata$sample_type <- ifelse(
37grepl("^s_(ExtrBlank|FiledBlank|pcrBlank)", metadata$sample_id),
38"control_sample",
39"true_sample"
40)
41
42# sort by seq_counts
43metadata <- metadata[order(metadata$seq_counts),]
44# add Index column
45metadata$Index <- seq(nrow(metadata))
46
47# check metadata
48head(metadata)
49
50
51### check summary statistics
52summary_stats <- metadata %>%
53group_by(sample_type) %>%
54summarise(
55 n_samples = n(),
56 total_sequences = sum(seq_counts),
57 mean_sequences = mean(seq_counts),
58 .groups = 'drop'
59)
60print(summary_stats)
61
62#-----------------------#
63### find contaminants ###
64#-----------------------#
65# ensure sample order matches
66metadata <- metadata[match(colnames(OTU_table), metadata$sample_id), ]
67
68# define control samples for decontam
69is.neg <- metadata$sample_type == "control_sample"
70
71# run decontam with default settings
72# note that t(OTU_table) is used to transpose the OTU table for decontam
73contam_result <- isContaminant(t(OTU_table),
74 method = "prevalence",
75 threshold = 0.1,
76 neg = is.neg)
77
78# check how many OTUs are considered as contaminants
79table(contam_result$contaminant)
80
81#------------------------#
82### check contaminants ###
83#------------------------#
84# transform OTU table presence/absence data
85otu_pa <- OTU_table
86otu_pa[otu_pa > 0] <- 1
87
88# make dataframe for plotting
89pos_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "true_sample"])
90neg_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "control_sample"])
91otu_pa_pos <- otu_pa[, pos_sample_ids]
92otu_pa_neg <- otu_pa[, neg_sample_ids]
93pa_neg <- rowSums(otu_pa_neg)
94pa_pos <- rowSums(otu_pa_pos)
95df.pa <- data.frame(
96pa.neg = pa_neg,
97pa.pos = pa_pos,
98contaminant = contam_result$contaminant
99)
100
101# visualize contaminant prevalence
102ggplot(data = df.pa, aes(x = pa.neg,
103 y = pa.pos,
104 color = contaminant)) +
105geom_point() +
106xlab("Prevalence of negative controls") +
107ylab("Prevalence of true samples") +
108theme_minimal()
109
110#-------------------------#
111### remove contaminants ###
112#-------------------------#
113# get contaminant OTU names
114contaminants <- rownames(contam_result)[contam_result$contaminant == TRUE]
115
116# filter out contaminant OTUs
117OTU_table_decontam <- OTU_table[!rownames(OTU_table) %in% contaminants, ]
118
119#----------------------------#
120### remove control samples ###
121#----------------------------#
122# Get true sample IDs
123true_samples <- metadata %>%
124filter(sample_type == "true_sample") %>%
125pull(sample_id)
126
127# filter OTU table to include only true_samples
128OTU_table_decontam <- OTU_table_decontam[, colnames(OTU_table_decontam) %in% true_samples]
129
130# Remove OTUs with zero sequences (if any)
131OTU_table_decontam <- OTU_table_decontam[rowSums(OTU_table_decontam) > 0, ]
132
133# Summary
134initial_number_of_OTUs = nrow(OTU_table)
135number_of_OTUs_after_excluding_controls = nrow(OTU_table_decontam)
136cat("Removed", initial_number_of_OTUs-number_of_OTUs_after_excluding_controls,
137 "contaminant OTUs\n")
138cat("OTUs left in the table:", number_of_OTUs_after_excluding_controls)
139
140### Write decontaminated OTU table to Excel file
141library(openxlsx)
142write.xlsx(OTU_table_decontam,
143 "OTU_table_decontam.xlsx",
144 rowNames = TRUE)
145
146# write the list of contaminant OTUs to a file;
147# this will be used to filter also the taxonomy table
148write.xlsx(as.data.frame(contaminants),
149 "contaminants.xlsx",
150 rowNames = FALSE)
The decontaminated OTU table is written to an Excel file (in your output directory).
Now, the OTUs that were considered as contaminants should be removed also from the taxonomy table and fasta file.
Example of an input taxonomy table:
Order |
Family |
Genus |
species |
|
OTU_01 |
Lepidoptera |
Nymphalidae |
Aglais |
Aglais urticae |
OTU_02 |
Lepidoptera |
Nymphalidae |
Genus_0022 |
Genus_0022_sp |
OTU_03 |
Carnivora |
Felidae |
Felis |
Felis catus |
OTU_04 |
Sarcoptiformes |
Pyroglyphidae |
Dermatophagoides |
Dermatophagoides_sp |
… |
… |
… |
… |
… |
In the below script, this table is represented as tab-delimited taxonomy_table.txt file.
Example of an input fasta file:
fasta file
>OTU_01
ACTTTATATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGAACTTCTCTTAGTTTATTAATTCG...
>OTU_02
ATAGTAGGAACATCTCTTAGTTTATTAATTCGAACTGAACTAGGAAATCCAGGTTCACTTATTGG...
>OTU_03
TCTTTACCTTTTATTCGGTGCCTGAGCTGGCATGGTGGGGACTGCTCTTAGTCTTCTAATCCGGG...
>OTU_04
TACTTTGTATTTTGTTTTTGGGGTGTGATCTGGTATGTTGGGGACTAGGTTCAGAAGACTAATTC...
...
In the below script, this file is represented as OTUs.fasta.
1#!/usr/bin/Rscript
2
3# load taxonomy table
4taxonomy = read.table("taxonomy_table.txt",
5 header = TRUE,
6 sep = "\t",
7 row.names = 1)
8
9# load list of contaminant OTUs
10library(openxlsx)
11contaminants = read.xlsx("contaminants.xlsx",
12 rowNames = FALSE,
13 colNames = TRUE)
14
15# extract contaminant OTU names (assuming first column is header)
16contaminants = contaminants[, 1]
17
18#---------------------------#
19### Filter taxonomy table ###
20#---------------------------#
21# filter out contaminant OTUs from taxonomy table
22taxonomy_filtered = taxonomy[!rownames(taxonomy) %in% contaminants, ]
23
24cat("Taxonomy table - Initial OTUs:", nrow(taxonomy), "\n")
25cat("Taxonomy table - After filtering:", nrow(taxonomy_filtered), "\n")
26cat("Removed", nrow(taxonomy) - nrow(taxonomy_filtered),
27 "contaminant OTUs from taxonomy table\n")
28
29# write filtered taxonomy table
30write.table(taxonomy_filtered,
31 "taxonomy_table_decontam.txt",
32 sep = "\t",
33 quote = FALSE,
34 row.names = TRUE,
35 col.names = NA)
36
37#-----------------------#
38### Filter fasta file ###
39#-----------------------#
40library(seqinr)
41
42# load fasta file
43OTUs_fasta = seqinr::read.fasta("OTUs.fasta")
44
45# filter out contaminant OTUs from fasta file
46OTUs_fasta_filtered = OTUs_fasta[!names(OTUs_fasta) %in% contaminants]
47
48cat("\nFasta file - Initial sequences:", length(OTUs_fasta), "\n")
49cat("Fasta file - After filtering:", length(OTUs_fasta_filtered), "\n")
50cat("Removed", length(OTUs_fasta) - length(OTUs_fasta_filtered),
51 "contaminant sequences from fasta file\n")
52
53# write filtered fasta file
54write.fasta(sequences = OTUs_fasta_filtered,
55 names = names(OTUs_fasta_filtered),
56 file.out = "OTUs_decontam.fasta",
57 nbchar = 999)
output
The final output data is OTU_table_decontam.txt,
taxonomy_table_decontam.txt and OTUs_decontam.fasta.
Additionally, the list of contaminant OTUs is written to contaminants.xlsx.



