Identify and remove contaminants

Here, we apply the decontam R package to identify and remove contaminants from the metabarcoding feature table (based on the decontam vignette).

Alongside with the biological samples, the metabarcoding workflow needs to include also control samples to assess the potential contaminations due to the sensitivity of the DNA-based methods. The control sample can include field controls (e.g., a sample that should contain no target DNA and is exposed to the same field conditions and handling as real samples), blank DNA extractions (i.e., no sample substrate added to DNA extraction tubes) and blank PCR reactions (i.e., no reaction with template DNA).

In many cases, even if the control samples appear to be clean, they may still contain some sequences in the final feature table. For smaller projects, including only few samples, the decontamination process may be easy via visual inspections. However, for larger projects, including many samples, the decontamination process may be more challenging, and here tools such as decontam can be very helpful.

Input data

The input data for the decontamination process include a metabarcoding feature table (ASV or OTU table, with control samples included) and sample metadata. Then, later, the identified contaminated features can be removed also from the taxonomy table and fasta file.

The metabarcoding feature table (per sequencing run/per batch processed simultaneously) with sequence abundances (example):

	sample1	sample2	sample3	s_ExtrBlank	s_FiledBlank	s_pcrBlank
OTU_01	4290	40550	3902	0	0	0
OTU_02	550	34501	2	0	0	0
OTU_03	0	0	0	136	7	201
OTU_04	3061	0	3465	1	0	0
…	…	…	…	…	…	…

control samples

Samples s_ExtrBlank, s_FiledBlank and s_pcrBlank are control samples for DNA extraction, field blank and PCR blank, respectively. They are used to assess the potential contaminations.

In the below script, this table is represented as tab-delimited OTU_table.txt file.

The sample metadata (example):

	DNA_concentration	sample_type	seq_counts
sample1	19.3	true_sample	133902
sample2	20.1	true_sample	155423
sample3	20	true_sample	250432
s_ExtrBlank	0.12	control_sample	247
s_FiledBlank	1.3	control_sample	34
s_pcrBlank	0.19	control_sample	325
…	…	…	…

The DNA_concentration is optional and can be included if available. This represents the DNA concentration of the sample prior pooling the samples into a single library for sequencing. The sample_type indicates the type of the sample: true_sample or control_sample. The seq_counts represents the number of sequences assigned to the sample.

Note

sample_type and seq_counts columns can be generated from the metabarcoding feature table (see below). So, essentially, no specific metadata file is required prior decontam if DNA concentration is not available.

Decontamination process

The decontamination process is performed using the decontam R package. Here, we do not use DNA concentration data, but only sequence counts and sample type (true_sample or control_sample). The procedure follows “prevalence” method: the presence/absence of each feature in true samples is compared to the presence/absence in control samples to identify contaminants. More information about that here: decontam vignette and in the decontam paper.

Note

Below workflow is for a single sequencing run/per batch processed simultaneously. If there are multiple sequencing runs with corresponding control samples, then those sequencing runs should be decontaminated independently.

decontamination

#!/usr/bin/Rscript

# install decontam package if not already installed
if (!requireNamespace("decontam", quietly = TRUE)) {
BiocManager::install("decontam")
} else {
cat("decontam is already installed, version",
     as.character(packageVersion("decontam")), "\n")
}

# load packages
library(decontam)
library(dplyr)
library(ggplot2)
set.seed(1)

# read OTU table
# tab-delimted txt file, samples as columns, OTUs in rows
OTU_table = read.table("OTU_table.txt",
                     header = TRUE,
                     sep = "\t",
                     row.names = 1)

#------------------------#
### Make metadata file ###
#------------------------#
# add seq_counts columns per sample
seq_counts <- colSums(OTU_table)

metadata <- data.frame(
sample_id = names(seq_counts),
seq_counts = seq_counts
)

# add sample_type column
metadata$sample_type <- ifelse(
grepl("^s_(ExtrBlank|FiledBlank|pcrBlank)", metadata$sample_id),
"control_sample",
"true_sample"
)

# sort by seq_counts
metadata <- metadata[order(metadata$seq_counts),]
# add Index column
metadata$Index <- seq(nrow(metadata))

# check metadata
head(metadata)


### check summary statistics
summary_stats <- metadata %>%
group_by(sample_type) %>%
summarise(
     n_samples = n(),
     total_sequences = sum(seq_counts),
     mean_sequences = mean(seq_counts),
     .groups = 'drop'
)
print(summary_stats)

#-----------------------#
### find contaminants ###
#-----------------------#
# ensure sample order matches
metadata <- metadata[match(colnames(OTU_table), metadata$sample_id), ]

# define control samples for decontam
is.neg <- metadata$sample_type == "control_sample"

# run decontam with default settings
# note that t(OTU_table) is used to transpose the OTU table for decontam
contam_result <- isContaminant(t(OTU_table),
                             method = "prevalence",
                             threshold = 0.1,
                             neg = is.neg)

# check how many OTUs are considered as contaminants
table(contam_result$contaminant)

#------------------------#
### check contaminants ###
#------------------------#
# transform OTU table presence/absence data
otu_pa <- OTU_table
otu_pa[otu_pa > 0] <- 1

# make dataframe for plotting
pos_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "true_sample"])
neg_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "control_sample"])
otu_pa_pos <- otu_pa[, pos_sample_ids]
otu_pa_neg <- otu_pa[, neg_sample_ids]
pa_neg <- rowSums(otu_pa_neg)
pa_pos <- rowSums(otu_pa_pos)
df.pa <- data.frame(
pa.neg = pa_neg,
pa.pos = pa_pos,
contaminant = contam_result$contaminant
)

# visualize contaminant prevalence
ggplot(data = df.pa, aes(x = pa.neg,
                         y = pa.pos,
                         color = contaminant)) +
geom_point() +
xlab("Prevalence of negative controls") +
ylab("Prevalence of true samples") +
theme_minimal()

#-------------------------#
### remove contaminants ###
#-------------------------#
# get contaminant OTU names
contaminants <- rownames(contam_result)[contam_result$contaminant == TRUE]

# filter out contaminant OTUs
OTU_table_decontam <- OTU_table[!rownames(OTU_table) %in% contaminants, ]

#----------------------------#
### remove control samples ###
#----------------------------#
# Get true sample IDs
true_samples <- metadata %>%
filter(sample_type == "true_sample") %>%
pull(sample_id)

# filter OTU table to include only true_samples
OTU_table_decontam <- OTU_table_decontam[, colnames(OTU_table_decontam) %in% true_samples]

# Remove OTUs with zero sequences (if any)
OTU_table_decontam <- OTU_table_decontam[rowSums(OTU_table_decontam) > 0, ]

# Summary
initial_number_of_OTUs = nrow(OTU_table)
number_of_OTUs_after_excluding_controls = nrow(OTU_table_decontam)
cat("Removed", initial_number_of_OTUs-number_of_OTUs_after_excluding_controls,
     "contaminant OTUs\n")
cat("OTUs left in the table:", number_of_OTUs_after_excluding_controls)

### Write decontaminated OTU table to Excel file
library(openxlsx)
write.xlsx(OTU_table_decontam,
         "OTU_table_decontam.xlsx",
         rowNames = TRUE)

# write the list of contaminant OTUs to a file;
# this will be used to filter also the taxonomy table
write.xlsx(as.data.frame(contaminants),
         "contaminants.xlsx",
         rowNames = FALSE)

The decontaminated OTU table is written to an Excel file (in your output directory).

Now, the OTUs that were considered as contaminants should be removed also from the taxonomy table and fasta file.

Example of an input taxonomy table:

	Order	Family	Genus	species
OTU_01	Lepidoptera	Nymphalidae	Aglais	Aglais urticae
OTU_02	Lepidoptera	Nymphalidae	Genus_0022	Genus_0022_sp
OTU_03	Carnivora	Felidae	Felis	Felis catus
OTU_04	Sarcoptiformes	Pyroglyphidae	Dermatophagoides	Dermatophagoides_sp
…	…	…	…	…

In the below script, this table is represented as tab-delimited taxonomy_table.txt file.

Example of an input fasta file:

fasta file

>OTU_01
ACTTTATATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGAACTTCTCTTAGTTTATTAATTCG...
>OTU_02
ATAGTAGGAACATCTCTTAGTTTATTAATTCGAACTGAACTAGGAAATCCAGGTTCACTTATTGG...
>OTU_03
TCTTTACCTTTTATTCGGTGCCTGAGCTGGCATGGTGGGGACTGCTCTTAGTCTTCTAATCCGGG...
>OTU_04
TACTTTGTATTTTGTTTTTGGGGTGTGATCTGGTATGTTGGGGACTAGGTTCAGAAGACTAATTC...
...

In the below script, this file is represented as OTUs.fasta.

remove contaminants from the taxonomy table

#!/usr/bin/Rscript

# load taxonomy table
taxonomy = read.table("taxonomy_table.txt",
                     header = TRUE,
                     sep = "\t",
                     row.names = 1)

# load list of contaminant OTUs
library(openxlsx)
contaminants = read.xlsx("contaminants.xlsx",
                         rowNames = FALSE,
                         colNames = TRUE)

# extract contaminant OTU names (assuming first column is header)
contaminants = contaminants[, 1]

#---------------------------#
### Filter taxonomy table ###
#---------------------------#
# filter out contaminant OTUs from taxonomy table
taxonomy_filtered = taxonomy[!rownames(taxonomy) %in% contaminants, ]

cat("Taxonomy table - Initial OTUs:", nrow(taxonomy), "\n")
cat("Taxonomy table - After filtering:", nrow(taxonomy_filtered), "\n")
cat("Removed", nrow(taxonomy) - nrow(taxonomy_filtered),
     "contaminant OTUs from taxonomy table\n")

# write filtered taxonomy table
write.table(taxonomy_filtered,
             "taxonomy_table_decontam.txt",
             sep = "\t",
             quote = FALSE,
             row.names = TRUE,
             col.names = NA)

#-----------------------#
### Filter fasta file ###
#-----------------------#
library(seqinr)

# load fasta file
OTUs_fasta = seqinr::read.fasta("OTUs.fasta")

# filter out contaminant OTUs from fasta file
OTUs_fasta_filtered = OTUs_fasta[!names(OTUs_fasta) %in% contaminants]

cat("\nFasta file - Initial sequences:", length(OTUs_fasta), "\n")
cat("Fasta file - After filtering:", length(OTUs_fasta_filtered), "\n")
cat("Removed", length(OTUs_fasta) - length(OTUs_fasta_filtered),
               "contaminant sequences from fasta file\n")

# write filtered fasta file
write.fasta(sequences = OTUs_fasta_filtered,
             names = names(OTUs_fasta_filtered),
             file.out = "OTUs_decontam.fasta",
             nbchar = 999)

output

The final output data is OTU_table_decontam.txt, taxonomy_table_decontam.txt and OTUs_decontam.fasta. Additionally, the list of contaminant OTUs is written to contaminants.xlsx.