Alternative text

Identify and remove contaminants

Here, we apply the decontam R package to identify and remove contaminants from the metabarcoding feature table (based on the decontam vignette).

Alongside with the biological samples, the metabarcoding workflow needs to include also control samples to assess the potential contaminations due to the sensitivity of the DNA-based methods. The control sample can include field controls (e.g., a sample that should contain no target DNA and is exposed to the same field conditions and handling as real samples), blank DNA extractions (i.e., no sample substrate added to DNA extraction tubes) and blank PCR reactions (i.e., no reaction with template DNA).

In many cases, even if the control samples appear to be clean, they may still contain some sequences in the final feature table. For smaller projects, including only few samples, the decontamination process may be easy via visual inspections. However, for larger projects, including many samples, the decontamination process may be more challenging, and here tools such as decontam can be very helpful.


Input data

The input data for the decontamination process include a metabarcoding feature table (ASV or OTU table, with control samples included) and sample metadata. Then, later, the identified contaminated features can be removed also from the taxonomy table and fasta file.

  • The metabarcoding feature table (per sequencing run/per batch processed simultaneously) with sequence abundances (example):

sample1

sample2

sample3

s_ExtrBlank

s_FiledBlank

s_pcrBlank

OTU_01

4290

40550

3902

0

0

0

OTU_02

550

34501

2

0

0

0

OTU_03

0

0

0

136

7

201

OTU_04

3061

0

3465

1

0

0

control samples

Samples s_ExtrBlank, s_FiledBlank and s_pcrBlank are control samples for DNA extraction, field blank and PCR blank, respectively. They are used to assess the potential contaminations.

In the below script, this table is represented as tab-delimited OTU_table.txt file.

  • The sample metadata (example):

DNA_concentration

sample_type

seq_counts

sample1

19.3

true_sample

133902

sample2

20.1

true_sample

155423

sample3

20

true_sample

250432

s_ExtrBlank

0.12

control_sample

247

s_FiledBlank

1.3

control_sample

34

s_pcrBlank

0.19

control_sample

325

The DNA_concentration is optional and can be included if available. This represents the DNA concentration of the sample prior pooling the samples into a single library for sequencing. The sample_type indicates the type of the sample: true_sample or control_sample. The seq_counts represents the number of sequences assigned to the sample.

Note

sample_type and seq_counts columns can be generated from the metabarcoding feature table (see below). So, essentially, no specific metadata file is required prior decontam if DNA concentration is not available.


Decontamination process

The decontamination process is performed using the decontam R package. Here, we do not use DNA concentration data, but only sequence counts and sample type (true_sample or control_sample). The procedure follows “prevalence” method: the presence/absence of each feature in true samples is compared to the presence/absence in control samples to identify contaminants. More information about that here: decontam vignette and in the decontam paper.

Note

Below workflow is for a single sequencing run/per batch processed simultaneously. If there are multiple sequencing runs with corresponding control samples, then those sequencing runs should be decontaminated independently.

decontamination
  1#!/usr/bin/Rscript
  2
  3# install decontam package if not already installed
  4if (!requireNamespace("decontam", quietly = TRUE)) {
  5BiocManager::install("decontam")
  6} else {
  7cat("decontam is already installed, version",
  8     as.character(packageVersion("decontam")), "\n")
  9}
 10
 11# load packages
 12library(decontam)
 13library(dplyr)
 14library(ggplot2)
 15set.seed(1)
 16
 17# read OTU table
 18# tab-delimted txt file, samples as columns, OTUs in rows
 19OTU_table = read.table("OTU_table.txt",
 20                     header = TRUE,
 21                     sep = "\t",
 22                     row.names = 1)
 23
 24#------------------------#
 25### Make metadata file ###
 26#------------------------#
 27# add seq_counts columns per sample
 28seq_counts <- colSums(OTU_table)
 29
 30metadata <- data.frame(
 31sample_id = names(seq_counts),
 32seq_counts = seq_counts
 33)
 34
 35# add sample_type column
 36metadata$sample_type <- ifelse(
 37grepl("^s_(ExtrBlank|FiledBlank|pcrBlank)", metadata$sample_id),
 38"control_sample",
 39"true_sample"
 40)
 41
 42# sort by seq_counts
 43metadata <- metadata[order(metadata$seq_counts),]
 44# add Index column
 45metadata$Index <- seq(nrow(metadata))
 46
 47# check metadata
 48head(metadata)
 49
 50
 51### check summary statistics
 52summary_stats <- metadata %>%
 53group_by(sample_type) %>%
 54summarise(
 55     n_samples = n(),
 56     total_sequences = sum(seq_counts),
 57     mean_sequences = mean(seq_counts),
 58     .groups = 'drop'
 59)
 60print(summary_stats)
 61
 62#-----------------------#
 63### find contaminants ###
 64#-----------------------#
 65# ensure sample order matches
 66metadata <- metadata[match(colnames(OTU_table), metadata$sample_id), ]
 67
 68# define control samples for decontam
 69is.neg <- metadata$sample_type == "control_sample"
 70
 71# run decontam with default settings
 72# note that t(OTU_table) is used to transpose the OTU table for decontam
 73contam_result <- isContaminant(t(OTU_table),
 74                             method = "prevalence",
 75                             threshold = 0.1,
 76                             neg = is.neg)
 77
 78# check how many OTUs are considered as contaminants
 79table(contam_result$contaminant)
 80
 81#------------------------#
 82### check contaminants ###
 83#------------------------#
 84# transform OTU table presence/absence data
 85otu_pa <- OTU_table
 86otu_pa[otu_pa > 0] <- 1
 87
 88# make dataframe for plotting
 89pos_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "true_sample"])
 90neg_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "control_sample"])
 91otu_pa_pos <- otu_pa[, pos_sample_ids]
 92otu_pa_neg <- otu_pa[, neg_sample_ids]
 93pa_neg <- rowSums(otu_pa_neg)
 94pa_pos <- rowSums(otu_pa_pos)
 95df.pa <- data.frame(
 96pa.neg = pa_neg,
 97pa.pos = pa_pos,
 98contaminant = contam_result$contaminant
 99)
100
101# visualize contaminant prevalence
102ggplot(data = df.pa, aes(x = pa.neg,
103                         y = pa.pos,
104                         color = contaminant)) +
105geom_point() +
106xlab("Prevalence of negative controls") +
107ylab("Prevalence of true samples") +
108theme_minimal()
109
110#-------------------------#
111### remove contaminants ###
112#-------------------------#
113# get contaminant OTU names
114contaminants <- rownames(contam_result)[contam_result$contaminant == TRUE]
115
116# filter out contaminant OTUs
117OTU_table_decontam <- OTU_table[!rownames(OTU_table) %in% contaminants, ]
118
119#----------------------------#
120### remove control samples ###
121#----------------------------#
122# Get true sample IDs
123true_samples <- metadata %>%
124filter(sample_type == "true_sample") %>%
125pull(sample_id)
126
127# filter OTU table to include only true_samples
128OTU_table_decontam <- OTU_table_decontam[, colnames(OTU_table_decontam) %in% true_samples]
129
130# Remove OTUs with zero sequences (if any)
131OTU_table_decontam <- OTU_table_decontam[rowSums(OTU_table_decontam) > 0, ]
132
133# Summary
134initial_number_of_OTUs = nrow(OTU_table)
135number_of_OTUs_after_excluding_controls = nrow(OTU_table_decontam)
136cat("Removed", initial_number_of_OTUs-number_of_OTUs_after_excluding_controls,
137     "contaminant OTUs\n")
138cat("OTUs left in the table:", number_of_OTUs_after_excluding_controls)
139
140### Write decontaminated OTU table to Excel file
141library(openxlsx)
142write.xlsx(OTU_table_decontam,
143         "OTU_table_decontam.xlsx",
144         rowNames = TRUE)
145
146# write the list of contaminant OTUs to a file;
147# this will be used to filter also the taxonomy table
148write.xlsx(as.data.frame(contaminants),
149         "contaminants.xlsx",
150         rowNames = FALSE)

The decontaminated OTU table is written to an Excel file (in your output directory).


Now, the OTUs that were considered as contaminants should be removed also from the taxonomy table and fasta file.

  • Example of an input taxonomy table:

Order

Family

Genus

species

OTU_01

Lepidoptera

Nymphalidae

Aglais

Aglais urticae

OTU_02

Lepidoptera

Nymphalidae

Genus_0022

Genus_0022_sp

OTU_03

Carnivora

Felidae

Felis

Felis catus

OTU_04

Sarcoptiformes

Pyroglyphidae

Dermatophagoides

Dermatophagoides_sp

In the below script, this table is represented as tab-delimited taxonomy_table.txt file.

  • Example of an input fasta file:

fasta file

>OTU_01
ACTTTATATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGAACTTCTCTTAGTTTATTAATTCG...
>OTU_02
ATAGTAGGAACATCTCTTAGTTTATTAATTCGAACTGAACTAGGAAATCCAGGTTCACTTATTGG...
>OTU_03
TCTTTACCTTTTATTCGGTGCCTGAGCTGGCATGGTGGGGACTGCTCTTAGTCTTCTAATCCGGG...
>OTU_04
TACTTTGTATTTTGTTTTTGGGGTGTGATCTGGTATGTTGGGGACTAGGTTCAGAAGACTAATTC...
...

In the below script, this file is represented as OTUs.fasta.

remove contaminants from the taxonomy table
 1#!/usr/bin/Rscript
 2
 3# load taxonomy table
 4taxonomy = read.table("taxonomy_table.txt",
 5                     header = TRUE,
 6                     sep = "\t",
 7                     row.names = 1)
 8
 9# load list of contaminant OTUs
10library(openxlsx)
11contaminants = read.xlsx("contaminants.xlsx",
12                         rowNames = FALSE,
13                         colNames = TRUE)
14
15# extract contaminant OTU names (assuming first column is header)
16contaminants = contaminants[, 1]
17
18#---------------------------#
19### Filter taxonomy table ###
20#---------------------------#
21# filter out contaminant OTUs from taxonomy table
22taxonomy_filtered = taxonomy[!rownames(taxonomy) %in% contaminants, ]
23
24cat("Taxonomy table - Initial OTUs:", nrow(taxonomy), "\n")
25cat("Taxonomy table - After filtering:", nrow(taxonomy_filtered), "\n")
26cat("Removed", nrow(taxonomy) - nrow(taxonomy_filtered),
27     "contaminant OTUs from taxonomy table\n")
28
29# write filtered taxonomy table
30write.table(taxonomy_filtered,
31             "taxonomy_table_decontam.txt",
32             sep = "\t",
33             quote = FALSE,
34             row.names = TRUE,
35             col.names = NA)
36
37#-----------------------#
38### Filter fasta file ###
39#-----------------------#
40library(seqinr)
41
42# load fasta file
43OTUs_fasta = seqinr::read.fasta("OTUs.fasta")
44
45# filter out contaminant OTUs from fasta file
46OTUs_fasta_filtered = OTUs_fasta[!names(OTUs_fasta) %in% contaminants]
47
48cat("\nFasta file - Initial sequences:", length(OTUs_fasta), "\n")
49cat("Fasta file - After filtering:", length(OTUs_fasta_filtered), "\n")
50cat("Removed", length(OTUs_fasta) - length(OTUs_fasta_filtered),
51               "contaminant sequences from fasta file\n")
52
53# write filtered fasta file
54write.fasta(sequences = OTUs_fasta_filtered,
55             names = names(OTUs_fasta_filtered),
56             file.out = "OTUs_decontam.fasta",
57             nbchar = 999)

output

The final output data is OTU_table_decontam.txt, taxonomy_table_decontam.txt and OTUs_decontam.fasta. Additionally, the list of contaminant OTUs is written to contaminants.xlsx.


Alternative text Alternative text Alternative text Alternative text