.. |eufund| image:: _static/eu_co-funded.png
:width: 200
:alt: Alternative text
.. |chfund| image:: _static/ch-logo-200x50.png
:width: 210
:alt: Alternative text
.. |ukrifund| image:: _static/ukri-logo-200x59.png
:width: 150
:alt: Alternative text
.. |logo_BGE_small| image:: _static/logo_BGE_alpha.png
:width: 120
:alt: Alternative text
:target: https://biodiversitygenomics.eu/
.. raw:: html
.. role:: red
Examine outputs
***************
The outputs of a metabarcoding workflow and their basic reading are described
in the :ref:`outputs section ` of :doc:`"What is a metabarcoding workflow?" <0what_is_metabarcoding>`.
In this section, we will examine the outputs of a metabarcoding workflow in relation to user
collected environmental **metadata**
to inspect basic patterns of species (metabarcoding features)
richness and composition across different environments.
Here, we illustrate the calculation of **species (OTU) richness**
and **composition** using the metabarcoding outputs from the
`BGE case study of high mountain systems `_
dataset, restricted here to **samples from Norway**, as an example.
The data are **Malaise trap** samples collected **weekly in 2023** over
**20 weeks** across **5 altitudes**.
We examine:
1. Was sequencing depth sufficient?
2. OTU richness patterns across sampled **altitudinal gradient** and **sampling season**.
___________________________________________________
Input data
==========
1. OTU table
2. Sample metadata
.. admonition:: example of an OTU table
+-------------+---------+---------+---------+---------+-----+
| | sample1 | sample2 | sample3 | sample4 | ... |
+-------------+---------+---------+---------+---------+-----+
| **OTU_01** | 579 | 0 | 0 | 0 | ... |
+-------------+---------+---------+---------+---------+-----+
| **OTU_02** | 405 | 345 | 449 | 430 | ... |
+-------------+---------+---------+---------+---------+-----+
| **OTU_03** | 0 | 0 | 231 | 69 | ... |
+-------------+---------+---------+---------+---------+-----+
| **OTU_04** | 0 | 62 | 345 | 0 | ... |
+-------------+---------+---------+---------+---------+-----+
.. admonition:: sample metadata
+-------------+---------+--------------+
| | site | altitude_(m) |
+-------------+---------+--------------+
| **sample1** | site_01 | 500 |
+-------------+---------+--------------+
| **sample2** | site_01 | 500 |
+-------------+---------+--------------+
| **sample3** | site_02 | 1000 |
+-------------+---------+--------------+
| **sample4** | site_02 | 1200 |
+-------------+---------+--------------+
.. code-block:: R
:caption: Load data
:linenos:
#!/usr/bin/Rscript
# Working directory and results directory paths
wd = "path/to/your/working/directory"
results = file.path(wd, "results")
# OTU table
OTU.table = file.path(wd, "OTU_table.csv")
# Sample metadata
sampleMetadata = file.path(wd, "metadata.csv")
_________________
Sequencing depth
================
Sequencing depth is the total number of sequences in a sample.
Although the amplicon library is pooled in equimolar concentrations prior to sequencing,
there can be still considerable variation in sequencing depth between samples.
If so, then for the statistical analyses, it is important to account for this variation.
One way to account for this variation is to normalise sequencing depth to a common value,
which can be achieved through rarefaction analysis.
Another way to account for this variation is to use sequencing depth per sample as a covariate
in the statistical analyses.
By including sequence counts per sample as a predictor,
differences in species richness or composition are evaluated
after statistically adjusting for variation in sequencing depth.
.. note::
With the bioinformatic processing (and post-processing steps),
we are trying to remove noise from the sequencing data,
but this inherently may filter out some true (low abundance) taxa.
Therefore, rarfaction analyses after the bioinformatic processing tries to
answer the question: "was sequencing depth sufficient to capture the taxa that pass our quality filters?",
not "was sequencing depth sufficient to capture all true taxa in the sample".
Here, we are examining if the sequencing depth per samples has been
sufficient, i.e., do the OTU accumulation curves flatten out (reach the asymptote)?
.. code-block:: R
:caption: Sequencing depth per sample
:linenos:
#!/usr/bin/Rscript
# Compute rarefaction curves for each sample
set.seed(1)
library(vegan)
rarecurve_data = rarecurve(
t(OTU_table), # transpose so samples are rows
step = 5000, # step size (seqs) in rarefaction curves
col = "steelblue",
label = FALSE
)
# Convert the list returned by rarecurve() into a tidy data frame
library(purrr)
library(dplyr)
library(tidyr)
rarecurve_data_to_ggplot =
map_dfr(rarecurve_data, bind_rows) %>%
bind_cols(Group = colnames(OTU_table), .) %>%
tidyr::pivot_longer(-Group) %>%
tidyr::drop_na() %>%
mutate(
n_seqs = as.numeric(stringr::str_replace(name, "N", ""))
) %>%
select(-name)
# Plot rarefaction curves (richness vs sequences) for all samples
p_rarefaction =
ggplot(data = rarecurve_data_to_ggplot,
aes(x = n_seqs, y = value, group = Group)) +
geom_line(alpha = 0.6, color = "steelblue") +
theme_minimal() +
labs(x = "Number of sequences",
y = "Number of OTUs",
title = "Rarefaction curves per sample") +
theme(
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
axis.text = element_text(size = 11),
axis.title = element_text(size = 12),
plot.title = element_text(size = 16)
)
# Save plot
ggsave(
filename = file.path(results, "rarefaction_curves.png"),
plot = p_rarefaction,
width = 8, height = 6, dpi = 300)
.. image:: _static/rarefaction_curves.png
:width: 600
:align: center
The resulting plot shows the OTU accumulation curves for each sample.
Most curves rise steeply at low read numbers and then level off,
indicating that additional sequencing yields
few (or no) new OTUs and that sequencing depth is generally
sufficient for capturing the diversity in these samples.
Now, let's confirm this by
calculating the slope of the rarefaction curve for each sample.
A **small slope** means that adding many more sequences yields
very few new OTUs (curve is flat),
so sequencing depth for that sample is likely sufficient;
a **larger slope** indicates the curve is still rising
and the sample may be under-sequenced.
.. code-block:: R
:caption: Check if samples reached the asymptote
:linenos:
#!/usr/bin/Rscript
### Calculate the slope of the rarefaction curve
# The slope between the last two points of each sample's curve indicates
# whether the curve has flattened (asymptote ≈ slope close to 0).
asymptote_info = rarecurve_data_to_ggplot %>%
group_by(Group) %>%
arrange(n_seqs) %>%
summarise(
last_depth = last(n_seqs),
last_richness = last(value),
prev_depth = nth(n_seqs, -2),
prev_richness = nth(value, -2),
slope = (last_richness - prev_richness) / (last_depth - prev_depth)
)
## Define a slope threshold for asymptote
threshold = 0.001 # ≈ 5 new OTUs per 5,000 additional sequences
# threshold = 0.0002 # more conservative: ≈ 1 new OTU per 5,000 sequences
# Mark samples whose curves have effectively reached asymptote
asymptote_info = asymptote_info %>%
mutate(
at_asymptote = abs(slope) < threshold # or use (last_richness - prev_richness) < 5
)
# How many samples reached asymptote?
sum(asymptote_info$at_asymptote)
#-> 88 samples reached to asymptote according to our threshold
# Show samples that did NOT reach asymptote
not_at_asymptote = asymptote_info %>%
filter(!at_asymptote) %>%
select(Group, last_depth, last_richness, slope)
not_at_asymptote
#-> Group last_depth last_richness slope
# BGE-HMS1295 8196 249 0.00108
# Summary statistics of sequences per sample
summary(asymptote_info$last_depth)
#-> Min. 1st Qu. Median Mean 3rd Qu. Max.
# 8196 575683 675578 716605 835014 1435769
Based on our threshold
(5 new OTUs per 5,000 additional sequences),
sample ``BGE-HMS1295`` fails to reach the asymptote.
Indeed, we see that this sample has relatively low sequencing
depth (8,196 sequences)
compared to the median (675,578 sequences) and mean (716,605 sequences).
Because, the sample ``BGE-HMS1295`` has cleary failed,
we will remove it from the analysis.
.. code-block:: R
:caption: Discard poor samples
:linenos:
#!/usr/bin/Rscript
# Remove sample BGE-HMS1295 from the analysis
remove_samples = c("BGE-HMS1295")
OTU_table = OTU_table[, -which(colnames(OTU_table) %in% remove_samples)]
# Check the dimensions of the OTU table
dim(OTU_table)
# -> 5839 OTUs, 88 samples (after removing sample BGE-HMS1295)
# Remove OTUs with zero sequences
OTU_table = OTU_table[rowSums(OTU_table) > 0, ]
dim(OTU_table)
# -> 5834 OTUs (after removing OTUs with zero sequences)
## Beause we removed a sample and OTUs with zero sequences,
## we need to update the sample metadata
sample_metadata = sample_metadata[colnames(OTU_table), ]
#-------------------------------#
## Sequence summary statistics ##
# Assess variability in sequencing depth per sample
sequences_per_sample <- colSums(OTU_table)
mean_depth <- mean(sequences_per_sample)
sd <- sd(sequences_per_sample)
cv <- sd_depth / mean_depth # coefficient of variation
range <- range(sequences_per_sample)
cat("\nMean depth:", round(mean_depth), "\n")
#-> Mean depth: 724655
cat("Standard deviation:", round(sd), "\n")
#-> Standard deviation: 218800
cat("Coefficient of variation:", round(cv, 3), "\n")
#-> Coefficient of variation: 0.302
cat("Range (min, max):", paste(range, collapse = " - "), "\n")
#-> Range (min, max): 378741 - 1435769
We have removed the sample with low sequencing depth.
The remaining samples have a mean sequencing depth of 724,655 sequences.
The number of sequences per sample range from 378,741 to 1,435,769 sequences,
and the coefficient of variation is 0.302. Latter means
that the standard deviation of read depth is about 30%
of the mean sequencing depth.
**Sequencing effort differs moderately between samples**, thus
it would be reasonable to **include sequencing depth as a covariate**
in the statistical analyses.
Below, we plot the sequencing depth per sample by sampling week.
We see that the **sequencing depth decreases with the sampling week**
(towards autumn).
.. code-block:: R
:caption: Sequencing depth per sample by sampling week
:linenos:
#!/usr/bin/Rscript
# Build sample_richness dataframe for downstream analyses
library(dplyr)
sample_richness <- tibble(
sample = colnames(OTU_table),
richness = colSums(OTU_table > 0)
) %>%
left_join(
sample_metadata %>%
as.data.frame() %>%
tibble::rownames_to_column("sample"),
by = "sample"
) %>%
mutate(sequencing_depth = as.numeric(colSums(OTU_table)[sample]))
## Plot sequencing depth per sample by sampling week
p = ggplot(sample_richness,
aes(x = week,
y = sequencing_depth)) +
geom_point(aes(color = sequencing_depth),
size = 2, alpha = 0.8) +
scale_color_viridis_c(option = "plasma",
name = "Sequencing depth") +
geom_smooth(method = "lm",
formula = y ~ poly(x, 2),
aes(group = 1),
color = "darkgreen") +
labs(
title = "Sequencing depth per sample by sampling week",
x = "Week",
y = "Number of sequences"
) +
theme_bw()
.. image:: _static/seqDepth_by_week.png
:width: 600
:align: center
If we would like to test if the
OTU richness differs between sampling seasons (spring, summer, autumn),
then we would need to **include sequencing depth as a covariate**
in the statistical analyses to
**separate the effect of sequencing depth
from the effect of sampling season** — even though all samples showed
sufficient sequencing depth (rarefaction curves reached asymptote).
If we would not include sequencing depth as a covariate,
then the effect of sampling season would be confounded by
the effect of sequencing depth (even when all samples reached asymptote,
samples with higher depth can still have slightly higher observed OTU richness,
so when depth differs between seasons the raw season effect on richness
mixes biology with that technical effect).
_________________
Richness
=========
Let's further examine the number of OTUs across
the altitudinal gradient.
.. code-block:: R
:caption: Total OTU richness per altitude
:linenos:
#!/usr/bin/Rscript
library(tidyverse)
sample_names = colnames(OTU_table)
# OTU richness for a given altitude
richness_for_alt = function(alt) {
these_samples = rownames(sample_metadata)[sample_metadata$altitude == alt]
these_samples = intersect(these_samples, sample_names)
if (length(these_samples) == 0) return(0L)
mat = as.matrix(OTU_table[, these_samples, drop = FALSE])
sum(rowSums(mat > 0) > 0)
}
# unique altitudes in sample metadata, sorted for plotting
alts = sort(unique(sample_metadata$altitude))
# create a dataframe with altitude and OTU richness
richness_by_altitude = tibble(
altitude = alts,
n_species = vapply(alts, richness_for_alt, integer(1))
)
# plot total richness vs altitude
p_tot = ggplot(richness_by_altitude,
aes(x = altitude, y = n_species, group = 1)) +
geom_line(aes(color = "Linear trend (lm)"), linewidth = 1) +
geom_smooth(aes(color = "Linear trend (lm)"),
method = "lm", se = TRUE,
linewidth = 0.9, show.legend = FALSE) +
geom_point(aes(color = "Altitude-specific richness"), size = 4) +
geom_text(aes(label = n_species),
vjust = -0.8, size = 3, color = "black") +
scale_color_manual(
name = "",
values = c(
"Altitude-specific richness" = "black",
"Linear trend (lm)" = "lightblue"
)
) +
labs(
title = "Total OTU richness across altitude",
x = "Altitude (m)",
y = "OTU richness"
) +
theme_bw(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
legend.position = "bottom"
)
# Save plot
ggsave(
filename = file.path(results, "richness_by_altitude.png"),
plot = p_tot,
width = 5, height = 4.5, dpi = 300)
.. image:: _static/richness_by_altitude.png
:width: 500
:align: center
The resulting pattern suggests that total OTU richness
tends to decline towards higher elevations.
This is an expected pattern,
as environmental stress increases with altitude, especially in the
higher latitude regions (such as Norway in that example dataset).
Let's test this relationship more formally.
.. code-block:: R
:caption: Richness by altitude with linear and quadratic fits
:linenos:
#!/usr/bin/Rscript
# Use Generalized Linear Model (GLM) with negative binomial
# distribution to model the richness data.
# Negative binomial distribution is a common choice for count data
# with overdispersion (variance > mean).
library(MASS)
# log10 transform the sequencing_depth
sample_richness$log_sequencing_depth =
log10(sample_richness$sequencing_depth)
# Fit linear and quadratic models (with altitude and sequencing depth)
glm_richness_linear = glm.nb(
richness ~ altitude + log_sequencing_depth,
data = sample_richness
)
glm_richness_quad = glm.nb(
richness ~ poly(altitude, 2) + log_sequencing_depth,
data = sample_richness
)
# Compare: is quadratic better than linear?
anova(glm_richness_linear, glm_richness_quad, test = "Chisq")
# -> Pr(Chi = 0.0006887357
# Quadratic model summary (coefficients on log(mean richness) scale)
summary(glm_richness_quad)
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 0.9717 1.9283 0.504 0.614320
# poly(altitude, 2)1 -0.8119 0.3778 -2.149 0.031645 *
# poly(altitude, 2)2 -1.3591 0.3765 -3.610 0.000307 ***
# log_sequencing_depth 0.8727 0.3300 2.645 0.008175 **
## Deviance-based pseudo-R2
pseudo_R2_richness_quad =
1 - (glm_richness_quad$deviance / glm_richness_quad$null.deviance)
cat("Pseudo-R2 (glm_richness_quad): ", round(pseudo_R2_richness_quad, 4), "\n")
#-> Pseudo-R2 (glm_richness_quad): 0.2002
# Plot richness vs altitude with linear and quadratic fits
library(ggplot2)
p = ggplot(sample_richness,
aes(x = altitude, y = richness)) +
geom_point(size = 3, alpha = 0.8) +
geom_smooth(method = "lm", aes(linetype = "Linear"),
color = "gray", se = F) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
aes(linetype = "Quadratic"),
color = "darkgreen", se = TRUE) +
scale_linetype_manual(name = "Trend",
values = c(Linear = "dashed",
Quadratic = "solid")) +
labs(
title = "OTU richness by altitude",
x = "Altitude (m)",
y = "OTU richness"
) +
theme_bw()
ggsave(
filename = file.path(results, "richness_by_altitude2.png"),
plot = p,
width = 6, height = 4.5, dpi = 300
)
.. image:: _static/richness_by_altitude2.png
:width: 600
:align: center
A negative-binomial GLM of OTU richness on altitude and sequencing depth
was significantly improved by adding a quadratic
altitude term (likelihood-ratio test *p* ≈ 0.0007).
That is, **OTU richness changes significantly across the altitudinal gradient**,
in a curved rather than linear way.
The model explains about 20% of the null deviance in richness
(pseudo-R2 = 0.20),
so other potential drivers
(e.g. habitat, season) are not captured in this model.
As we can notice from the plot above (and below), the per altitude richness
is highly variable, and that originates from the sampling
across 20 weeks at each altitude.
Thus, let's include seasonality into the model.
.. image:: _static/seasonal_richness_by_altitude.png
:width: 600
:align: center
.. code-block:: R
:caption: Model with altitude, sequencing depth, and seasonality
:linenos:
#!/usr/bin/Rscript
library(MASS)
# GLM for OTU richness with altitude, sequencing depth, and seasonality
glm_season = glm.nb(
richness ~ poly(altitude, 2) +
log_sequencing_depth +
cos(2 * pi * week / 52) + # annual cycle as a cosine function
sin(2 * pi * week / 52), # annual cycle as a sin function
data = sample_richness
)
summary(glm_season)
#->
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.11097 1.45721 1.449 0.147439
# poly(altitude, 2)1 -1.04241 0.27319 -3.816 0.000136 ***
# poly(altitude, 2)2 -1.30751 0.26214 -4.988 6.11e-07 ***
# log_sequencing_depth 0.47899 0.25160 1.904 0.056946 .
# cos(2 * pi * week/52) -1.18471 0.11981 -9.888 < 2e-16 ***
# sin(2 * pi * week/52) -0.78676 0.09008 -8.734 < 2e-16 ***
#
## Compare glm_season to model without seasonal terms
# Deviance-based pseudo-R2:
pseudoR2_quad = 1 - (glm_richness_quad$deviance / glm_richness_quad$null.deviance)
pseudoR2_season = 1 - (glm_season$deviance / glm_season$null.deviance)
cat("Pseudo-R2 (quad):", pseudoR2_quad)
#-> Pseudo-R2 (quad): 0.2001701
cat("Pseudo-R2 (season):", pseudoR2_season)
#-> Pseudo-R2 (season): 0.6192453
## AIC comparison
AIC_quad = AIC(glm_richness_quad)
AIC_season = AIC(glm_season)
cat("AIC (quad):", AIC_quad)
#-> AIC (quad): 1147.558
cat("AIC (season):", AIC_season)
#-> AIC (season): 1085.174
## Compare using anova test (Likelihood Ratio Test)
anova(glm_richness_quad, glm_season, test = "Chisq")
#-> Pr(Chi) = 3.885781e-15
In addition to altitude,
the **seasonality** (cosine and sine of week) is a significant driver
of insect OTU richness (*p* < 0.0001).
Comapred with model that did not have
seasonality terms included, the model with seasonality
had much higher pseudo-R2 (0.619 vs 0.200); indicating a substantial
**increase in the model's explanatory power**.
This increase was not simply due to overfitting, as the likelihood-ratio test
shows that the model with seasonality fits significantly better than
the model without it (*p* < 0.0001), had lower AIC (1085 vs 1148).
_________________
|logo_BGE_small| |eufund| |chfund| |ukrifund|