.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png
:width: 300
:alt: Alternative text
:target: https://biodiversitygenomics.eu/
.. |eufund| image:: _static/eu_co-funded.png
:width: 200
:alt: Alternative text
.. |chfund| image:: _static/ch-logo-200x50.png
:width: 210
:alt: Alternative text
.. |ukrifund| image:: _static/ukri-logo-200x59.png
:width: 150
:alt: Alternative text
.. |logo_BGE_small| image:: _static/logo_BGE_alpha.png
:width: 120
:alt: Alternative text
:target: https://biodiversitygenomics.eu/
.. raw:: html
.. role:: red
.. raw:: html
.. role:: yellow-background
|logo_BGE_alpha|
.. _remove_contaminants:
Identify and remove contaminants
********************************
Here, we apply the `decontam `_ R package to identify and remove contaminants from the metabarcoding feature table
(based on the `decontam vignette `_).
Alongside with the biological samples, the metabarcoding workflow needs to include also **control samples to assess the potential contaminations**
due to the sensitivity of the DNA-based methods.
The control sample can include field controls (e.g., a sample that should contain no target DNA and is exposed to the same field conditions and handling as real samples),
blank DNA extractions (i.e., no sample substrate added to DNA extraction tubes) and
blank PCR reactions (i.e., no reaction with template DNA).
In many cases, even if the control samples appear to be clean, they may still contain some sequences in the final feature table.
For smaller projects, including only few samples, the decontamination process may be easy via visual inspections.
However, **for larger projects, including many samples, the decontamination process may be more challenging, and here tools such as decontam can be very helpful**.
___________________________________________________
Input data
==========
The input data for the decontamination process include a **metabarcoding feature table** (ASV or OTU table, with control samples included) and **sample metadata**.
Then, later, the identified contaminated features can be removed also from the :ref:`taxonomy table and fasta file `.
- The metabarcoding feature table (per sequencing run/per batch processed simultaneously) with sequence abundances (example):
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| | sample1 | sample2 | sample3 | :red:`s_ExtrBlank` | :red:`s_FiledBlank` | :red:`s_pcrBlank` |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| **OTU_01** | 4290 | 40550 | 3902 | 0 | 0 | 0 |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| **OTU_02** | 550 | 34501 | 2 | 0 | 0 | 0 |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| **OTU_03** | 0 | 0 | 0 | 136 | 7 | 201 |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| **OTU_04** | 3061 | 0 | 3465 | 1 | 0 | 0 |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
| ... | ... | ... | ... | ... | ... | ... |
+------------+---------+---------+---------+--------------------+---------------------+-------------------+
.. admonition:: control samples
Samples **s_ExtrBlank**, **s_FiledBlank** and **s_pcrBlank** are control samples for DNA extraction, field blank and PCR blank, respectively.
They are used to assess the potential contaminations.
In the below script, this table is represented as **tab-delimited** ``OTU_table.txt`` file.
- The sample metadata (example):
+------------------+-------------------+----------------+------------+
| | DNA_concentration | sample_type | seq_counts |
+------------------+-------------------+----------------+------------+
| **sample1** | 19.3 | true_sample | 133902 |
+------------------+-------------------+----------------+------------+
| **sample2** | 20.1 | true_sample | 155423 |
+------------------+-------------------+----------------+------------+
| **sample3** | 20 | true_sample | 250432 |
+------------------+-------------------+----------------+------------+
| **s_ExtrBlank** | 0.12 | control_sample | 247 |
+------------------+-------------------+----------------+------------+
| **s_FiledBlank** | 1.3 | control_sample | 34 |
+------------------+-------------------+----------------+------------+
| **s_pcrBlank** | 0.19 | control_sample | 325 |
+------------------+-------------------+----------------+------------+
| ... | ... | ... | ... |
+------------------+-------------------+----------------+------------+
The ``DNA_concentration`` is **optional** and can be included if available. This represents the DNA concentration of the sample
prior pooling the samples into a single library for sequencing.
The ``sample_type`` indicates the type of the sample: true_sample or control_sample.
The ``seq_counts`` represents the number of sequences assigned to the sample.
.. note::
``sample_type`` and ``seq_counts`` columns can be generated from the metabarcoding feature table (see below).
**So, essentially, no specific metadata file is required prior** *decontam* **if DNA concentration is not available.**
___________________________________________________
Decontamination process
=======================
The decontamination process is performed using the *decontam* R package.
Here, we do not use DNA concentration data, but only sequence counts and sample type (true_sample or control_sample).
**The procedure follows "prevalence" method:**
the presence/absence of each feature in true samples is compared to the presence/absence in control samples to identify contaminants.
More information about that here: `decontam vignette `_
and in the `decontam paper `_.
.. note::
Below workflow is for a single sequencing run/per batch processed simultaneously.
**If there are multiple sequencing runs with corresponding control samples, then those sequencing runs should be decontaminated independently.**
.. code-block:: R
:caption: decontamination
:linenos:
#!/usr/bin/Rscript
# install decontam package if not already installed
if (!requireNamespace("decontam", quietly = TRUE)) {
BiocManager::install("decontam")
} else {
cat("decontam is already installed, version",
as.character(packageVersion("decontam")), "\n")
}
# load packages
library(decontam)
library(dplyr)
library(ggplot2)
set.seed(1)
# read OTU table
# tab-delimted txt file, samples as columns, OTUs in rows
OTU_table = read.table("OTU_table.txt",
header = TRUE,
sep = "\t",
row.names = 1)
#------------------------#
### Make metadata file ###
#------------------------#
# add seq_counts columns per sample
seq_counts <- colSums(OTU_table)
metadata <- data.frame(
sample_id = names(seq_counts),
seq_counts = seq_counts
)
# add sample_type column
metadata$sample_type <- ifelse(
grepl("^s_(ExtrBlank|FiledBlank|pcrBlank)", metadata$sample_id),
"control_sample",
"true_sample"
)
# sort by seq_counts
metadata <- metadata[order(metadata$seq_counts),]
# add Index column
metadata$Index <- seq(nrow(metadata))
# check metadata
head(metadata)
### check summary statistics
summary_stats <- metadata %>%
group_by(sample_type) %>%
summarise(
n_samples = n(),
total_sequences = sum(seq_counts),
mean_sequences = mean(seq_counts),
.groups = 'drop'
)
print(summary_stats)
#-----------------------#
### find contaminants ###
#-----------------------#
# ensure sample order matches
metadata <- metadata[match(colnames(OTU_table), metadata$sample_id), ]
# define control samples for decontam
is.neg <- metadata$sample_type == "control_sample"
# run decontam with default settings
# note that t(OTU_table) is used to transpose the OTU table for decontam
contam_result <- isContaminant(t(OTU_table),
method = "prevalence",
threshold = 0.1,
neg = is.neg)
# check how many OTUs are considered as contaminants
table(contam_result$contaminant)
#------------------------#
### check contaminants ###
#------------------------#
# transform OTU table presence/absence data
otu_pa <- OTU_table
otu_pa[otu_pa > 0] <- 1
# make dataframe for plotting
pos_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "true_sample"])
neg_sample_ids <- na.omit(metadata$sample_id[metadata$sample_type == "control_sample"])
otu_pa_pos <- otu_pa[, pos_sample_ids]
otu_pa_neg <- otu_pa[, neg_sample_ids]
pa_neg <- rowSums(otu_pa_neg)
pa_pos <- rowSums(otu_pa_pos)
df.pa <- data.frame(
pa.neg = pa_neg,
pa.pos = pa_pos,
contaminant = contam_result$contaminant
)
# visualize contaminant prevalence
ggplot(data = df.pa, aes(x = pa.neg,
y = pa.pos,
color = contaminant)) +
geom_point() +
xlab("Prevalence of negative controls") +
ylab("Prevalence of true samples") +
theme_minimal()
#-------------------------#
### remove contaminants ###
#-------------------------#
# get contaminant OTU names
contaminants <- rownames(contam_result)[contam_result$contaminant == TRUE]
# filter out contaminant OTUs
OTU_table_decontam <- OTU_table[!rownames(OTU_table) %in% contaminants, ]
#----------------------------#
### remove control samples ###
#----------------------------#
# Get true sample IDs
true_samples <- metadata %>%
filter(sample_type == "true_sample") %>%
pull(sample_id)
# filter OTU table to include only true_samples
OTU_table_decontam <- OTU_table_decontam[, colnames(OTU_table_decontam) %in% true_samples]
# Remove OTUs with zero sequences (if any)
OTU_table_decontam <- OTU_table_decontam[rowSums(OTU_table_decontam) > 0, ]
# Summary
initial_number_of_OTUs = nrow(OTU_table)
number_of_OTUs_after_excluding_controls = nrow(OTU_table_decontam)
cat("Removed", initial_number_of_OTUs-number_of_OTUs_after_excluding_controls,
"contaminant OTUs\n")
cat("OTUs left in the table:", number_of_OTUs_after_excluding_controls)
### Write decontaminated OTU table to Excel file
library(openxlsx)
write.xlsx(OTU_table_decontam,
"OTU_table_decontam.xlsx",
rowNames = TRUE)
# write the list of contaminant OTUs to a file;
# this will be used to filter also the taxonomy table
write.xlsx(as.data.frame(contaminants),
"contaminants.xlsx",
rowNames = FALSE)
**The decontaminated OTU table is written to an Excel file (in your output directory).**
.. _taxonomy_table:
___________________________________________________
Now, the OTUs that were considered as **contaminants should be removed also from the taxonomy table** and **fasta file**.
- Example of an input taxonomy table:
+------------+----------------+---------------+------------------+---------------------+
| | Order | Family | Genus | species |
+------------+----------------+---------------+------------------+---------------------+
| **OTU_01** | Lepidoptera | Nymphalidae | Aglais | Aglais urticae |
+------------+----------------+---------------+------------------+---------------------+
| **OTU_02** | Lepidoptera | Nymphalidae | Genus_0022 | Genus_0022_sp |
+------------+----------------+---------------+------------------+---------------------+
| **OTU_03** | Carnivora | Felidae | Felis | Felis catus |
+------------+----------------+---------------+------------------+---------------------+
| **OTU_04** | Sarcoptiformes | Pyroglyphidae | Dermatophagoides | Dermatophagoides_sp |
+------------+----------------+---------------+------------------+---------------------+
| ... | ... | ... | ... | ... |
+------------+----------------+---------------+------------------+---------------------+
In the below script, this table is represented as **tab-delimited** ``taxonomy_table.txt`` file.
- Example of an input fasta file:
.. admonition:: fasta file
.. code-block:: text
:class: small-font
>OTU_01
ACTTTATATTTTATTTTTGGAATTTGAGCAGGAATAGTAGGAACTTCTCTTAGTTTATTAATTCG...
>OTU_02
ATAGTAGGAACATCTCTTAGTTTATTAATTCGAACTGAACTAGGAAATCCAGGTTCACTTATTGG...
>OTU_03
TCTTTACCTTTTATTCGGTGCCTGAGCTGGCATGGTGGGGACTGCTCTTAGTCTTCTAATCCGGG...
>OTU_04
TACTTTGTATTTTGTTTTTGGGGTGTGATCTGGTATGTTGGGGACTAGGTTCAGAAGACTAATTC...
...
In the below script, this file is represented as ``OTUs.fasta``.
.. code-block:: R
:caption: remove contaminants from the taxonomy table
:linenos:
#!/usr/bin/Rscript
# load taxonomy table
taxonomy = read.table("taxonomy_table.txt",
header = TRUE,
sep = "\t",
row.names = 1)
# load list of contaminant OTUs
library(openxlsx)
contaminants = read.xlsx("contaminants.xlsx",
rowNames = FALSE,
colNames = TRUE)
# extract contaminant OTU names (assuming first column is header)
contaminants = contaminants[, 1]
#---------------------------#
### Filter taxonomy table ###
#---------------------------#
# filter out contaminant OTUs from taxonomy table
taxonomy_filtered = taxonomy[!rownames(taxonomy) %in% contaminants, ]
cat("Taxonomy table - Initial OTUs:", nrow(taxonomy), "\n")
cat("Taxonomy table - After filtering:", nrow(taxonomy_filtered), "\n")
cat("Removed", nrow(taxonomy) - nrow(taxonomy_filtered),
"contaminant OTUs from taxonomy table\n")
# write filtered taxonomy table
write.table(taxonomy_filtered,
"taxonomy_table_decontam.txt",
sep = "\t",
quote = FALSE,
row.names = TRUE,
col.names = NA)
#-----------------------#
### Filter fasta file ###
#-----------------------#
library(seqinr)
# load fasta file
OTUs_fasta = seqinr::read.fasta("OTUs.fasta")
# filter out contaminant OTUs from fasta file
OTUs_fasta_filtered = OTUs_fasta[!names(OTUs_fasta) %in% contaminants]
cat("\nFasta file - Initial sequences:", length(OTUs_fasta), "\n")
cat("Fasta file - After filtering:", length(OTUs_fasta_filtered), "\n")
cat("Removed", length(OTUs_fasta) - length(OTUs_fasta_filtered),
"contaminant sequences from fasta file\n")
# write filtered fasta file
write.fasta(sequences = OTUs_fasta_filtered,
names = names(OTUs_fasta_filtered),
file.out = "OTUs_decontam.fasta",
nbchar = 999)
.. admonition:: output
The final output data is ``OTU_table_decontam.txt``,
``taxonomy_table_decontam.txt`` and ``OTUs_decontam.fasta``.
Additionally, the list of contaminant OTUs is written to ``contaminants.xlsx``.
____________________________________________________
|logo_BGE_small| |eufund| |chfund| |ukrifund|