Arthropods/COI

This is executable step-by-step pipeline for COI amplicon data from Illumina sequencing machine. The bioinformatic workflow results in amplicon sequence variants (ASVs) and well as operational taxonomic units (OTUs).

Note

The full bioinformatics workflow can be automatically run through PipeCraft2, a software package which represents a graphical user interface wrapper for all the bioinformatic steps below. See the example workflow for COI.

Citation of the pipeline

When using this pipeline, please cite as: Anslan, S, Bahram, M, Hiiesalu, I, Tedersoo, L. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data. Mol Ecol Resour. 2017; 17: e234-e240. https://doi.org/10.1111/1755-0998.12692

Also, please cite the original resources of the wrapped software.

Dependencies

Process	Software	Version
Remove primers	cutadapt	4.9
Quality filtering	DADA2	1.30
Denoise	DADA2	1.30
Merge paired-end reads	DADA2	1.30
Chimera filtering	DADA2	1.30
Remove tag-jumps	UNCROSS2
Merge sequencing runs*	DADA2	1.30
Taxonomy assignment	vsearch	2.28.1
Get target taxa	R
Remove NUMTs	metaMATE	0.4.3
Clustering ASVs to OTUs	vsearch	2.28.1
Post-clusteringlustering	BLAST	2.12.0+
Post-clusteringlustering	LULU	0.1.0+

*only applicable when there are multiple sequencing runs per study.

Note

All the dependencies/software of the pipeline are available on a Docker image.

Download Docker for windows
Download Docker for Mac
Install Docker for Linux - follow the guidelines under appropriate Linux distribution

get the Docker image

docker pull pipecraft/bioscanflow:2

example of running the pipeline via Docker image

# run docker
 # specify the files location with -v flag  ($PWD = the current working directory)
docker run -i --tty -v $PWD/:/Files pipecraft/bioscanflow:2

# inside the container, the files are accessible in the /Files dir
cd Files

# checking if cutadapt is available
cutadapt -h

# ready to run the pipe as below ...
 ## make sure that via the shared folder (-v) path you have access also to the reference databases.

convert docker image to apptainer (singularity) image for running the pipeline in HPC

 # pull the docker image and save it as a singularity image
 apptainer pull --name bioscanflow_2.sif docker://pipecraft/bioscanflow:2

 # test if eg cutadapt is available
 apptainer exec bioscanflow_2.sif cutadapt -h

 # run the script via singularity image
 apptainer exec --bind $PWD:/Files bioscanflow_2.sif /Files/run_pipe.sh

Data structure

Multiple sequencing runs

Important

When aiming to combine samples from multiple sequencing runs, then follow the below directory structure

Directory structure:

/multiRunDir (directory names can be changed)
├── /sequencing_set01
│   ├── sample1.R1.fastq
│   ├── sample1.R2.fastq
│   ├── sample2.R1.fastq
│   ├── sample2.R2.fastq
│   ├── …
├── /sequencing_set02
│   ├── sampleA.R1.fastq
│   ├── sampleA.R2.fastq
│   ├── sampleB.R1.fastq
│   ├── sampleB.R2.fastq
│   ├── …
└── /sequencing_set03
├── sample11.R1.fastq
├── sample11.R2.fastq
├── sample12.R1.fastq
├── sample12.R2.fastq
├── …

Note

Fastq files with the same name will be considered as the same sample and will be merged in the “Merge sequencing runs” step.

Single sequencing run

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

Remove primers

Remove primer strings from paired-end data.

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

Note

Here, assuming that all sequences are in 5’-3’ orientation! (3’-5’ orient sequences will be discarded with this workflow)

Important

- Paired-end files must contain “R1” and “R2” strings (not just _1 and _2)!

- Sample names must not contain “R1” and “R2” strings (i.e. not FR123_001_R1.fastq/FR123_001_R2.fastq)

remove primers with cutadapt

 #!/bin/bash
 ## workflow to remove primers with cutadapt

 # specify the identifier string for the R1 files
 read_R1="_R1"

 # specify primers
 fwd_primer=$"CCHGAYATRGCHTTYCCHCG" # this is the forward primer BF3
 rev_primer=$"CDGGRTGNCCRAARAAYCA"  # this is the reverse primer BR2

 # specify primers
 #fwd_primer=$"GGWACWRGWTGRACWITITAYCCYCC" # this is the forward primer mICOIintF-XT
 #rev_primer=$"TAIACYTCIGGRTGICCRAARAAYCA"  # this is the reverse primer jgHCO2198

 # specify primers
 #fwd_primer=$"GGDACWGGWTGAACWGTWTAYCCHCC" # this is the forward primer FwhF2
 #rev_primer=$"GTRATWGCHCCDGCTARWACWGG"  # this is the reverse primer FwhR2n

 # edit primer trimming settings
 mismatches="2"    # Numer of allowed mismatches in primer string search;
                     # if set as 1, then allow 1 mismatch;
                     # if set as 0.1, then allow mismatch in 10% of the bases.
 overlap="22"      # The minimum overlap length. Keep it nearly as high
                     # as the primer length to avoid short random matches.
 pair_filter="any" # Option 'any' discards a read pair if primers are not found in
                     # either of the read pairs (R1 and R2).
                     # Option 'both' keeps the read pair if a primer is found in
                     # at least one of the read pairs.
 ##
 # get the reverse complementary of the primers
     # needed when the amplicon length is shorter than the sequencing cycle
 fwd_primer_rc=$(echo $fwd_primer | rev | tr "ACGTRYKMBDHV" "TGCAYRMKVHDB")
 rev_primer_rc=$(echo $rev_primer | rev | tr "ACGTRYKMBDHV" "TGCAYRMKVHDB")

 # get directory names if working with multiple sequencing runs
     # in that case, my working folder = /multiRunDir (see dir structure above)
 DIRS=$(ls -d *) # -> sequencing_set01 sequencing_set02 sequencing_set03

 for sequencing_run in $DIRS; do
     printf "\nWorking with $sequencing_run \n"
     cd $sequencing_run
     #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
     # make output dirs
     mkdir -p primersCut_out
     mkdir -p primersCut_out/untrimmed

     ### Clip primers with cutadapt
     for inputR1 in *$read_R1*; do
         inputR2=$(echo $inputR1 | sed -e 's/R1/R2/')
         cutadapt --quiet \
         -e $mismatches \
         --minimum-length 32 \
         --overlap $overlap \
         --no-indels \
         --cores=0 \
         --untrimmed-output primersCut_out/untrimmed/$inputR1 \
         --untrimmed-paired-output primersCut_out/untrimmed/$inputR2 \
         --pair-filter=$pair_filter \
         -g $fwd_primer \
         -a $fwd_primer...$rev_primer_rc";optional" \
         -G $rev_primer \
         -A $rev_primer...$fwd_primer_rc";optional" \
         -o primersCut_out/$inputR1 \
         -p primersCut_out/$inputR2 \
         $inputR1 $inputR2
     done
     #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
     cd ..
 done

Quality filtering

Quality filtering of the fastq files based on the allowed maximum error rate per sequence (as in DADA2).

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

quality filtering in DADA2 (in R)

 #!/usr/bin/Rscript
 ## workflow to perform quality filtering within DADA2

 #load dada2 library
 library('dada2')

 # specify the identifier string for the R1 files
 read_R1 = "_R1"

 # get the identifier string for the R2 files
 read_R2 = gsub("R1", "R2", read_R1)

 # capturing the directory structure when working with multiple runs
 wd = getwd() # -> wd is "~/multiRunDir"
 dirs = list.dirs(recursive = FALSE)
 for (i in 1:length(dirs)) {
     if(length(dirs) > 1) {
         setwd(dirs[i])
         print(paste0("Working with ", dirs[i]))
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         # output path
         path_results = "qualFiltered_out"
         # input and output file paths
         R1s = sort(list.files("primersCut_out", pattern = read_R1, full.names = TRUE))
         R2s = sort(list.files("primersCut_out", pattern = read_R2, full.names = TRUE))
         #sample names
         sample_names = sapply(strsplit(basename(R1s), read_R1), `[`, 1)

         # filtered files path
         filtR1 = file.path(path_results, paste0(sample_names, ".R1.", "fastq.gz"))
         filtR2 = file.path(path_results, paste0(sample_names, ".R2.", "fastq.gz"))
         names(filtR1) = sample_names
         names(filtR2) = sample_names

         #quality filtering
         qfilt = filterAndTrim(R1s, filtR1, R2s, filtR2,
                             maxN = 0,            # max number of allowed N bases.
                             maxEE = c(2, 2),     # max error rate per R1 and R2 read, respectively.
                             truncQ = 2,          # truncate reads at the first instance of a quality score less than or equal to specified value.
                             truncLen = c(0, 0),  # truncate reads after specified length for R1 and R2 reads, respectively.
                             maxLen = 600,        # discard reads longer than specified.
                             minLen = 100,        # discard reads shorter than specified.
                             minQ = 2,            # discard reads (after truncation) that contain a quality score below specified value.
                             matchIDs = TRUE,     # output paired-end reads with matching IDs (for merging).
                             compress = TRUE,     # gzip the output
                             multithread = TRUE)  # use multiple threads
         saveRDS(qfilt, file.path(path_results, "qfilt_reads.rds"))

         # make sequence count report
         seq_count = cbind(qfilt)
         colnames(seq_count) = c("input", "qualFiltered")
         seq_count = as.data.frame(seq_count)
         seq_count$sample = sample_names
         # reorder columns
         seq_count = seq_count[, c("sample", "input", "qualFiltered")]
         write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"),
                             row.names = FALSE, quote = FALSE)

         # save filtered R objects for denoising and merging (below)
         filtR1 = sort(list.files(path_results, pattern = ".R1.fastq.gz", full.names = TRUE))
         filtR2 = sort(list.files(path_results, pattern = ".R2.fastq.gz", full.names = TRUE))
         sample_names = sapply(strsplit(basename(filtR1), ".R1.fastq.gz"), `[`, 1)
         saveRDS(filtR1, file.path(path_results, "filtR1.rds"))
         saveRDS(filtR2, file.path(path_results, "filtR2.rds"))
         saveRDS(sample_names, file.path(path_results, "sample_names.rds"))
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         #set working directory back to "/multiRunDir"
         setwd(wd)
     i = i + 1
     }
 }

Denoise and merge paired-end reads

Denoise and merge paired-end Illumina reads as in DADA2.

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

denoise and merge paired-end reads in DADA2

 #!/usr/bin/Rscript
 ## workflow to perform DADA2 denoising and merging

 # load dada2 library
 library('dada2')

 # capturing the directory structure when working with multiple runs
 wd = getwd() # -> wd is "~/multiRunDir"
 dirs = list.dirs(recursive = FALSE)
 for (i in 1:length(dirs)) {
     if(length(dirs) > 1) {
         setwd(dirs[i])
         print(paste0("Working with ", dirs[i]))
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         #load quality filtered files
         filtR1 = readRDS("qualFiltered_out/filtR1.rds")
         filtR2 = readRDS("qualFiltered_out/filtR2.rds")
         qfilt = readRDS("qualFiltered_out/qfilt_reads.rds")
         sample_names = readRDS("qualFiltered_out/sample_names.rds")
         cat("\n sample names = ", sample_names, "\n ")
         names(filtR1) = sample_names
         names(filtR2) = sample_names

         # create output dir
         path_results = "denoised_merged"
         dir.create(path_results, showWarnings = FALSE)

         print("# Denoising ...")
         # learn the error rates
         errF = learnErrors(filtR1, multithread = TRUE)
         errR = learnErrors(filtR2, multithread = TRUE)

         # make error rate figures
         pdf(file.path(path_results, "Error_rates_R1.pdf"))
           print( plotErrors(errF) )
         dev.off()
         pdf(file.path(path_results, "Error_rates_R2.pdf"))
           print( plotErrors(errR) )
         dev.off()

         # Sample inference and merger of paired-end reads
         mergers = vector("list", length(sample_names))
         names(mergers) = sample_names
         for(sample in sample_names) {
           cat("\n -- Processing:", sample, "\n")
           derepF = derepFastq(filtR1[[sample]])
           ddF = dada(derepF, err = errF, multithread = TRUE)
           derepR = derepFastq(filtR2[[sample]])
           ddR = dada(derepR, err = errR, multithread = TRUE)
           merger = mergePairs(ddF, derepF, ddR, derepR)
           mergers[[sample]] = merger
         }
         rm(derepF); rm(derepR)
         gc()
         saveRDS(mergers, (file.path(path_results, "mergers.rds")))

         # make sequence table
         ASV_tab = makeSequenceTable(mergers)
         #write RDS object
         saveRDS(ASV_tab, (file.path(path_results, "rawASV_table.rds")))

         # make sequence count report
         getN = function(x) sum(getUniques(x))
         #remove 0 seqs samples from qfilt statistics
         row_sub = apply(qfilt, 1, function(row) all(row !=0 ))
         qfilt = qfilt[row_sub, ]
         seq_count = cbind(qfilt, sapply(mergers, getN))
         colnames(seq_count) = c("input", "qualFiltered", "denoised_and_merged")
         rownames(seq_count) = sample_names
         write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"),
                                 row.names = TRUE, quote = FALSE)
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         print("--------")
         setwd(wd)
     i = i + 1
     }
 }

Chimera filtering

Remove putative chimeras with DADA2 ‘consensus’ mode.

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

remove chimeras in DADA2

 #!/usr/bin/Rscript
 ## workflow to perform chimera filtering within DADA2

 # load libraries
 library('dada2')
 library('openssl')

 # chimera filtering method
 method = "consensus"

 # collapse ASVs that have no mismatshes or internal indels (identical up to shifts and/or length)
 collapseNoMismatch = "true"  #true/false

 # capturing the directory structure when working with multiple runs
 wd = getwd() # -> wd is "~/multiRunDir"
 dirs = list.dirs(recursive = FALSE)
 for (i in 1:length(dirs)) {
     if(length(dirs) > 1) {
         setwd(dirs[i])
         print(paste0("Working with ", dirs[i]))
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         # load denoised and merged ASVs
         rawASV_table = readRDS("denoised_merged/rawASV_table.rds")
         # create output dir
         path_results="ASV_table"
         dir.create(path_results, showWarnings = FALSE)
         # Remove chimeras
         print("Removing chimeric ASVs ...")
         chim_filt = removeBimeraDenovo(
                             rawASV_table, method = method,
                             multithread = TRUE,
                             verbose = TRUE)
         saveRDS(chim_filt, "ASV_table/chim_filt.rds")

         ### format and save ASV table and ASVs.fasta files
         # sequence headers
         asv_seqs = colnames(chim_filt)
         asv_headers = openssl::sha1(asv_seqs)
         # transpose sequence table
         tchim_filt = t(chim_filt)
         # add sequences to 1st column
         tchim_filt = cbind(row.names(tchim_filt), tchim_filt)
         colnames(tchim_filt)[1] = "Sequence"
         # row names as sequence headers
         row.names(tchim_filt) = asv_headers
         # write ASVs.fasta to path_results
         asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs))
         write(asv_fasta, file.path(path_results, "ASVs.fasta"))
         # write ASVs table to path_results
         write.table(tchim_filt, file.path(path_results, "ASV_table.txt"),
                                 sep = "\t", col.names = NA,
                                 row.names = TRUE, quote = FALSE)

         ### collapse ASVs that have no mismatshes or internal indels
                             # (identical up to shifts and/or length)
         if (collapseNoMismatch == "true") {
             print("Collapsing identical ASVs ...")
             ASV_tab_collapsed = collapseNoMismatch(chim_filt,
                                 minOverlap = 20, orderBy = "abundance",
                                 identicalOnly = FALSE, vec = TRUE,
                                 band = -1, verbose = TRUE)
             saveRDS(ASV_tab_collapsed, file.path(path_results, "ASV_table_collapsed.rds"))

             ### format and save ASV table and ASVs.fasta files
             # sequence headers
             asv_seqs = colnames(ASV_tab_collapsed)
             asv_headers = openssl::sha1(asv_seqs)
             # transpose sequence table
             tASV_tab_collapsed = t(ASV_tab_collapsed)
             # add sequences to 1st column
             tASV_tab_collapsed = cbind(row.names(tASV_tab_collapsed), tASV_tab_collapsed)
             colnames(tASV_tab_collapsed)[1] = "Sequence"
             #row names as sequence headers
             row.names(tASV_tab_collapsed) = asv_headers
             # write ASVs.fasta to path_results
             asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs))
             write(asv_fasta, file.path(path_results, "ASVs_collapsed.fasta"))
             # write ASVs table to path_results
             write.table(tASV_tab_collapsed, file.path(path_results, "ASVs_table_collapsed.txt"),
                                     sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE)

             # print summary
             print(paste0("Output = ", length(colnames(ASV_tab_collapsed)),
                             " chimera filtered (+collapsed) ASVs, with a total of ",
                             sum(rowSums(ASV_tab_collapsed)),
                             " sequences."))
             print("--------")
         } else {
             # print summary
             print(paste0("Output = ", length(colnames(chim_filt)),
                             " chimera filtered ASVs, with a total of ",
                             sum(rowSums(chim_filt)),
                             " sequences."))
             print("--------")
         }
                 #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         setwd(wd)
     i = i + 1
     }
 }

Remove tag-jumps

Remove putative tag-jumps with UNCROSS2.

When working with a single directory that hosts your fastq files, then
ignore (do not execute) the script lines in yellow.

removing putative tag-jumps with UNCROSS2 method

#!/usr/bin/Rscript
## Script to perform tag-jump removal; (C) Vladimir Mikryukov,
                                          # edit, Sten Anslan

 # load libraries
 library(data.table)

 # set parameters
 set_f = 0.03 # f-parameter of UNCROSS (e.g., 0.03)
 set_p = 1    # p-parameter (e.g., 1.0)

 # output dir
 path_results="ASV_table"

 # capturing the directory structure when working with multiple runs
 wd = getwd() # -> wd is "~/multiRunDir"
 dirs = list.dirs(recursive = FALSE)
 for (i in 1:length(dirs)) {
     if(length(dirs) > 1) {
         setwd(dirs[i])
         print(paste0("Working with ", dirs[i]))
         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         # load ASV table
         # loading ASV_table_collapsed if collapseNoMismatch was "true" (above)
         if (file.exists("ASV_table/ASV_table_collapsed.rds") == TRUE) {
             tab = readRDS("ASV_table/ASV_table_collapsed.rds")
             cat("input table = ASV_table/ASV_table_collapsed.rds\n")
         } else { # loading chimera filtered ASV table
           tab = readRDS("ASV_table/chim_filt.rds")
           cat("input table = ASV_table/chim_filt.rds\n")
         }

         # format ASV table
         ASVTABW = as.data.table(t(tab), keep.rownames = TRUE)
         colnames(ASVTABW)[1] = "ASV"
         # convert to long format
         ASVTAB = melt(data = ASVTABW, id.vars = "ASV",
         variable.name = "SampleID", value.name = "Abundance")
         # remove zero-OTUs
         ASVTAB = ASVTAB[ Abundance > 0 ]
         # estimate total abundance of sequence per plate
         ASVTAB[ , Total := sum(Abundance, na.rm = TRUE), by = "ASV" ]

         ## UNCROSS score
         uncross_score = function(x, N, n, f = 0.01, tmin = 0.1, p = 1){
           z = f * N / n               # Expected treshold
           sc = 2 / (1 + exp(x/z)^p)   # t-score
           res = data.table(Score = sc, TagJump = sc >= tmin)
           return(res)
         }

         # esimate UNCROSS score
         cat(" estimating UNCROSS score\n")
         ASVTAB = cbind(
           ASVTAB,
           uncross_score(
             x = ASVTAB$Abundance,
             N = ASVTAB$Total,
             n = length(unique(ASVTAB$SampleID)),
             f = as.numeric(set_f),
             p = as.numeric(set_p)
             )
           )
         cat(" number of tag-jumps: ", sum(ASVTAB$TagJump, na.rm = TRUE), "\n")

         # tag-jump stats
         TJ = data.table(
             Total_reads = sum(ASVTAB$Abundance),
             Number_of_TagJump_Events = sum(ASVTAB$TagJump),
             TagJump_reads = sum(ASVTAB[ TagJump == TRUE ]$Abundance, na.rm = T))

         TJ$ReadPercent_removed = with(TJ, (TagJump_reads / Total_reads * 100))
         fwrite(x = TJ, file = "ASV_table/TagJump_stats.txt", sep = "\t")

         # prepare ASV tables, remove tag-jumps
         ASVTAB = ASVTAB[ TagJump == FALSE ]
         # convert to wide format
         RES = dcast(data = ASVTAB,
           formula = ASV ~ SampleID,
           value.var = "Abundance", fill = 0)
         # sort rows (by total abundance)
         clz = colnames(RES)[-1]
         otu_sums = rowSums(RES[, ..clz], na.rm = TRUE)
         RES = RES[ order(otu_sums, decreasing = TRUE) ]

         # output table that is compadible with dada2
         output = as.matrix(RES, sep = "\t", header = TRUE, rownames = 1,
                                 check.names = FALSE, quote = FALSE)
         output = t(output)
         saveRDS(output, ("ASV_table/ASV_table_TagJumpFiltered.rds"))

         ### format and save ASV table and ASVs.fasta files
         # sequence headers
         asv_seqs = colnames(output)
         asv_headers = openssl::sha1(asv_seqs)
         # transpose sequence table
         toutput = t(output)
         # add sequences to 1st column
         toutput = cbind(row.names(toutput), toutput)
         colnames(toutput)[1] = "Sequence"
         #row names as sequence headers
         row.names(toutput) = asv_headers
         # write ASVs.fasta to path_results
         asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs))
         write(asv_fasta, file.path(path_results, "ASVs_TagJumpFiltered.fasta"))
         # write ASVs table to path_results
         write.table(toutput, file.path(path_results, "ASV_table_TagJumpFiltered.txt"),
                                 sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE)

         # print summary
         print(paste0("Output = ", length(colnames(output)), " ASVs, with a total of ",
                                     sum(rowSums(output)), " sequences."))

         #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#
         print("--------")
         setwd(wd)
     i = i + 1
     }
 }

Merge sequencing runs

If previous processing was applied on multiple sequencing runs , then here,
merge those sequenceing runs to form a single, unified ASV table.
Assuming that tag-jump filtering was performed (inputs = ASV_table_TagJumpFiltered.rds)

merge ASV tables from multiple sequencing runs

 #!/usr/bin/Rscript
 ## Merge sequencing runs, if working with multiple ones

 # load libraries
 library('dada2')

 # after merging multiple ASV tables ...
     # collapse ASVs that have no mismatshes or internal indels
 collapseNoMismatch = "true"  #true/false

 # capturing the directory structure when working with multiple runs
 wd = getwd() # -> wd is "~/multiRunDir"
 dirs = list.dirs(recursive = FALSE)
 tables = c()
 # load tables from multiple sequencing runs (dirs)
 for (i in 1:length(dirs)) {
     if(length(dirs) > 1) {
         setwd(dirs[i])
         tables = append(tables, print(file.path(paste0(dirs[i], "/ASV_table"),
                                             "ASV_table_TagJumpFiltered.rds")))
         setwd(wd)
     i = i + 1
     }
 }

 # Merge multiple ASV tables
 print("# Merging multiple ASV tables ...")
 ASV_tables = lapply(tables, readRDS)
 merged_table = mergeSequenceTables(tables = ASV_tables, repeats = "sum", tryRC = FALSE)

 ### collapse ASVs that have no mismatshes or internal indels
 if (collapseNoMismatch == "true") {
     print("# Collapsing identical ASVs ...")
     merged_table_collapsed = collapseNoMismatch(merged_table,
                             minOverlap = 20, orderBy = "abundance",
                             identicalOnly = FALSE, vec = TRUE,
                             band = -1, verbose = TRUE)
     saveRDS(merged_table_collapsed, "merged_table_collapsed.rds")

     ### format and save ASV table and ASVs.fasta files
     # sequence headers
     asv_seqs = colnames(merged_table_collapsed)
     asv_headers = openssl::sha1(asv_seqs)
     # transpose sequence table
     tmerged_table_collapsed = t(merged_table_collapsed)
     # add sequences to 1st column
     tmerged_table_collapsed = cbind(row.names(tmerged_table_collapsed), tmerged_table_collapsed)
     colnames(tmerged_table_collapsed)[1] = "Sequence"
     #row names as sequence headers
     row.names(tmerged_table_collapsed) = asv_headers
     # write ASVs.fasta
     asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs))
     write(asv_fasta, "ASVs.merged_collapsed.fasta")
     # write ASVs table
     write.table(tmerged_table_collapsed, "ASV_table.merged_collapsed.txt",
                             sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE)

     # print summary
     print(paste0("Output = ", length(colnames(merged_table_collapsed)),
                     " ASVs, with a total of ",
                     sum(rowSums(merged_table_collapsed)),
                     " sequences."))
 } else {
     saveRDS(merged_table, "merged_table.rds")
     ### format and save ASV table and ASVs.fasta files
     # sequence headers
     asv_seqs = colnames(merged_table)
     asv_headers = openssl::sha1(asv_seqs)
     # transpose sequence table
     tmerged_table = t(merged_table)
     # add sequences to 1st column
     tmerged_table = cbind(row.names(tmerged_table), tmerged_table)
     colnames(tmerged_table)[1] = "Sequence"
     #row names as sequence headers
     row.names(tmerged_table) = asv_headers
     # write ASVs.fasta to path_results
     asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs))
     write(asv_fasta, "ASVs.merged.fasta")
     # write ASVs table to path_results
     write.table(tmerged_table, "ASV_table.merged.txt",
                             sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE)

     # print summary
     print(paste0("Output = ", length(colnames(merged_table)),
                     " ASVs, with a total of ",
                     sum(rowSums(merged_table)),
                     " sequences."))
 }

Taxonomy assignment

Taxonomy assignment with SINTAX (vsearch) against BOLDistilled database.

— Download the latest BOLDistilled database for SINTAX here —

assign taxonomy with SINTAX

 #!/bin/bash

 # download the BOLDistilled reference database
 wget \
   "https://us-sea-1.linodeobjects.com/boldistilled/sintax.zip"
 # unzip the database
 unzip sintax.zip

 # specify reference database for SINTAX
 reference_database="sintax/BOLDistilled_COI_Oct2025_SEQUENCES_sintax.fasta"
 reference_database=$(realpath $reference_database) # get database names with full path

 # specify input fasta file
 cd ASV_table
 ASV_fasta="ASVs_TagJumpFiltered.fasta"
 ASV_fasta_tmp="ASVs_TagJumpFiltered_minmax.fasta"

 # select by size to only retain ASVs that are within the range of expected variation. In this case we set it to 400 up to 430 bps
 vsearch --fastx_filter $ASV_fasta \
         --fastq_minlen 400 \
         --fastq_maxlen 430 \
         --fastaout $ASV_fasta_tmp

 mv $ASV_fasta_tmp $ASV_fasta

 # Run SINTAX classification
 time vsearch --sintax $ASV_fasta \
     --db $reference_database \
     --tabbedout SINTAX.taxonomy.txt \
     --sintax_cutoff 0.8 \
     --threads 16

Get target taxa

This part filters the ASV dataset to include only target taxonomic group for the following analyses.
For example, if you are interested in Hymenoptera, then discard all ASVs that do not match to the target taxon based on the user defined threshold (default = 0.8).

See below if no pre-selection is preferred

get only target taxon annotations

 #!/usr/bin/env Rscript
 ### Filter dataset based on SINTAX results to include target taxa

 # specify taxon and threshold
 taxon="Animalia"  # target taxonomic group(s);
                      # multiple groups should be from the same taxonomic level
                      # separator is "," (e.g., "Hymenoptera, Lepidoptera")
 tax_level="kingdom"  # allowed levels: kingdom | phylum | class | order | family | genus
 threshold="0.8"      # threshold for considering an ASV as a target taxon
 class_threshold = 0.8  # threshold for class level identification

 # specify the ASV table and ASVs.fasta file that would be filtered to include only target taxa
 ASV_fasta = "ASVs_TagJumpFiltered.fasta"
 ASV_table = "ASV_table_TagJumpFiltered.txt"

 # specify the SINTAX-classifier output file (taxonomy file)
 taxtab="SINTAX.taxonomy.txt"

 #--------------------------------------#
 library(stringr)
 library(dplyr)
 library(Biostrings)

 # Function to parse SINTAX taxonomy format from vsearch output
 parse_sintax = function(tax_string) {
 # Initialize result with NAs
 result = list(
     kingdom = NA, kingdom_conf = 0,
     phylum = NA, phylum_conf = 0,
     class = NA, class_conf = 0,
     order = NA, order_conf = 0,
     family = NA, family_conf = 0,
     genus = NA, genus_conf = 0,
     species = NA, species_conf = 0
 )

 if (is.na(tax_string) || tax_string == "" || tax_string == "*") {
     return(result)
 }

 # Split by comma
 ranks = strsplit(tax_string, ",")[[1]]

 for (rank in ranks) {
     # Extract rank prefix (d:, k:, p:, c:, o:, f:, g:, s:)
     if (grepl("^d:", rank)) {
     # Domain (skip, not used)
     next
     } else if (grepl("^k:", rank)) {
     # Kingdom
     match = regmatches(rank, regexec("k:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$kingdom = match[2]
         result$kingdom_conf = as.numeric(match[3])
     }
     } else if (grepl("^p:", rank)) {
     # Phylum
     match = regmatches(rank, regexec("p:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$phylum = match[2]
         result$phylum_conf = as.numeric(match[3])
     }
     } else if (grepl("^c:", rank)) {
     # Class
     match = regmatches(rank, regexec("c:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$class = match[2]
         result$class_conf = as.numeric(match[3])
     }
     } else if (grepl("^o:", rank)) {
     # Order
     match = regmatches(rank, regexec("o:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$order = match[2]
         result$order_conf = as.numeric(match[3])
     }
     } else if (grepl("^f:", rank)) {
     # Family
     match = regmatches(rank, regexec("f:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$family = match[2]
         result$family_conf = as.numeric(match[3])
     }
     } else if (grepl("^g:", rank)) {
     # Genus
     match = regmatches(rank, regexec("g:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$genus = match[2]
         result$genus_conf = as.numeric(match[3])
     }
     } else if (grepl("^s:", rank)) {
     # Species
     match = regmatches(rank, regexec("s:([^(]+)\\(([0-9.]+)\\)", rank))[[1]]
     if (length(match) == 3) {
         result$species = match[2]
         result$species_conf = as.numeric(match[3])
     }
     }
 }

 return(result)
 }

 # read ASV table
 table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1)

 # read SINTAX taxonomy table (vsearch --sintax output format)
 # Format: ASV_ID \t taxonomy_string \t strand \t other_columns
 tax_raw = read.table(taxtab, sep = "\t", check.names = F, header = F,
                     stringsAsFactors = F, quote = "", comment.char = "", fill = TRUE)

 # Take first two columns only (ASV_ID and taxonomy)
 tax_raw = tax_raw[, 1:2]
 colnames(tax_raw) = c("ASV", "taxonomy")
 rownames(tax_raw) = tax_raw$ASV

 cat("\n Input =", nrow(tax_raw), "features.\n")

 # Parse SINTAX taxonomy strings
 tax_list = lapply(tax_raw$taxonomy, parse_sintax)
 tax = do.call(rbind, lapply(tax_list, as.data.frame))
 rownames(tax) = tax_raw$ASV

 # taxon list
 taxon_list = strsplit(taxon, ", ")[[1]]

 ### extract only target-taxon ASVs from the 'raw' SINTAX results
 tax_filtered = tax %>%
     filter(.data[[tax_level]] %in% taxon_list)

 cat("\n Found", nrow(tax_filtered), "ASVs matching", taxon, "at", tax_level, "level.\n")

 ### Apply additional filter: class must be identified (confidence >= class_threshold)
 tax_filtered = tax_filtered %>%
     filter(class_conf >= class_threshold & !is.na(class))

 cat(" After filtering for class identification (threshold >=", class_threshold, "):",
     nrow(tax_filtered), "ASVs retained.\n")

 ### change all tax ranks to "unclassified_*" when
     # the confidence values is less than the specified threshold
 # kingdom
 tax_filtered = tax_filtered %>%
     mutate(kingdom = ifelse(kingdom_conf < threshold | is.na(kingdom),
                             "unclassified_root", as.character(kingdom)))

 # phylum
 tax_filtered = tax_filtered %>%
     mutate(phylum = ifelse(phylum_conf < threshold | is.na(phylum),
                         paste0("unclassified_", kingdom), as.character(phylum)))
 tax_filtered$phylum = stringr::str_replace(tax_filtered$phylum, "unclassified_unclassified_",
                                         "unclassified_")

 # class
 tax_filtered = tax_filtered %>%
     mutate(class = ifelse(class_conf < threshold | is.na(class),
                         paste0("unclassified_", phylum), as.character(class)))
 tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_",
                                         "unclassified_")

 # order
 tax_filtered = tax_filtered %>%
     mutate(order = ifelse(order_conf < threshold | is.na(order),
                         paste0("unclassified_", class), as.character(order)))
 tax_filtered$order = stringr::str_replace(tax_filtered$order, "unclassified_unclassified_",
                                         "unclassified_")

 # family
 tax_filtered = tax_filtered %>%
     mutate(family = ifelse(family_conf < threshold | is.na(family),
                         paste0("unclassified_", order), as.character(family)))
 tax_filtered$family = stringr::str_replace(tax_filtered$family, "unclassified_unclassified_",
                                         "unclassified_")

 # genus
 tax_filtered = tax_filtered %>%
     mutate(genus = ifelse(genus_conf < threshold | is.na(genus),
                         paste0("unclassified_", family), as.character(genus)))
 tax_filtered$genus = stringr::str_replace(tax_filtered$genus, "unclassified_unclassified_",
                                         "unclassified_")

 # species to genus_sp when the confidence values is < 0.9
 tax_filtered = tax_filtered %>%
     mutate(species = ifelse(species_conf < 0.9 | is.na(species),
                             paste0(genus, "_sp"), as.character(species)))

 ### count occurrences of each taxon in df (SINTAX results)
 count_taxa = function(df, taxa) {
 sapply(taxa, function(taxon) sum(apply(df, 1, function(row) any(row == taxon))))
 }
 taxon_counts = count_taxa(tax_filtered, taxon_list)

 # Check the counts
 if (all(taxon_counts == 0)) {
     print("ERROR: None of the specified taxa are present in the SINTAX results.")
 } else {
     if (any(taxon_counts == 0)) {
         warning("One or more of the specified taxa are not present in the SINTAX results.")
     }
     cat("\n Taxon counts:\n")
     print(taxon_counts)
 }

 ### extract only target-taxon ASVs from the 'threshold filtered' SINTAX results
 tax_filtered_thresh = tax_filtered %>%
     filter(.data[[tax_level]] %in% taxon_list)

 # Remove confidence columns for output
 tax_filtered_output = tax_filtered_thresh %>%
     select(kingdom, phylum, class, order, family, genus, species)

 # write filtered SINTAX taxonomy table
 tax_filtered_output = cbind(ASV = rownames(tax_filtered_output), tax_filtered_output)
 write.table(tax_filtered_output,
             file = "SINTAX.taxonomy.filt.txt",
             quote = F,
             row.names = F,
             sep = "\t")

 ### filter the ASV table to match ASVs that were kept in the tax_filtered table
 table_filt = table[rownames(table) %in% rownames(tax_filtered_thresh), ]

 ### check ASV table; if 1st col is sequence, then remove it for metaMATE
 if (colnames(table_filt)[1] == "Sequence") {
     cat("\n;; 1st column was 'Sequence', removing this ... \n")
     table_filt = table_filt[, -1]
 }

 # write filtered table
 table_filt = cbind(ASV = rownames(table_filt), table_filt)
 write.table(table_filt,
             file = paste0(sub("\\.[^.]*$", "_tax_filt.txt", ASV_table)),
             quote = F,
             row.names = F,
             sep = "\t")

 # filter ASV_fasta
 fasta = readDNAStringSet(ASV_fasta)
 fasta.tax_filt = fasta[names(fasta) %in% rownames(table_filt)]

 # write filtered ASV_fasta
 writeXStringSet(fasta.tax_filt,
                 paste0(sub("\\.[^.]*$", "_tax_filt.fasta", ASV_fasta)),
                 width = max(width(fasta.tax_filt)))

If no pre-selection is preferred, then just remove “Sequence” column from the ASV table

remove “Sequence” column from the ASV table

 # read ASV table
 ASV_table = "ASV_table.txt"
 table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1)

 # check ASV table; if 1st col is sequence, then remove it for metaMATE
 if (colnames(table)[1] == "Sequence") {
     cat("## removing 'Sequence' column ... \n")
     table = table[, -1]

     # write filtered table
     table_filt = cbind(ASV = rownames(table), table)
     write.table(table_filt,
             file = paste0(sub("\\.[^.]*$", ".noSeq.txt", ASV_table)),
             quote = F,
             row.names = F,
             sep = "\t")

 } else {
     cat("## there was no 'Sequence' column; proceed with the current table ... \n")
 }

Remove NUMTs

Remove putative NUMTs with metaMATE.

This follows the workflow to automatically filter the ASVs by retaining maximum of 5% of estimated non-authentic-ASVs (verifiednonauthentic_retained_p < 0.05).

Important

1. metaMATE expects specifications file that states the filtering strategies. See more info here. Here, we will be using the metaMATE’s default specifications.txt file.

metaMATE requires a reference COI database to determine verified-authentic ASVs. Herein using BOLDistilled database.

— Download the latest BOLDistilled database here (click) —

If you have your own set of reference sequences, then use those; or merge those with the other databases (such as the above one) to extend the ref. database.

Check standard genetic codes here for genetic_code setting below.

get required specifications file and ref database

#!/bin/bash

 # download the default specifications file,
   # using this in metaMATE-find
 wget "https://raw.githubusercontent.com/tjcreedy/metamate/main/specifications.txt"
 # specify specifications file for metaMATE
 specifications="specifications.txt"
 specifications=$(realpath $specifications) # get full directory path


 # download the BOLDistilled reference databse
 wget \
  "https://us-sea-1.linodeobjects.com/boldistilled/sintax.zip"
 # unzip the database and edit name
 unzip sintax.zip

 # specify reference database for metaMATE
 reference_database="sintax/BOLDistilled_COI_Oct2025_SEQUENCES_sintax.fasta"
 reference_database=$(realpath $reference_database) # get full directory path

cluster ASVs at a 90% similarity threshold for abundance filtering

 #!/bin/bash
 ## run metaMATE-find

 # cluster with 10% threshold
 vsearch --cluster_fast ASVs_TagJumpFiltered_tax_filt.fasta --id 0.9 \
 --uc ASVs_TagJumpFiltered_tax_filt_clustered.uc

 # select only H & S
 cat ASVs_TagJumpFiltered_tax_filt_clustered.uc | grep -v "^C" \
 > ASVs_TagJumpFiltered_tax_filt_clustered_onlyHS.uc

 # now extract the information to match the input requirements from metamate
 awk -F'\t' 'BEGIN {OFS=","} {print $9, $2}' ASVs_TagJumpFiltered_tax_filt_clustered_onlyHS.uc \
 > ASV_to_cluster_map.csv

run metaMATE-find

 #!/bin/bash
 ## run metaMATE-find

 ## go to the directory that hosts your ASVs.fasta and ASV table files.

 # specify input ASVs table and fasta
 ASV_table="ASV_table_TagJumpFiltered_tax_filt.txt" # make sure that the 2nd col is not "Sequence"
 ASV_fasta="ASVs_TagJumpFiltered_tax_filt.fasta"    # specify ASVs fasta file
 taxgroups="ASV_to_cluster_map.csv"                 # comment out or change filename if sequence binning is done in another way

 # specify variables
 genetic_code="5"        # the standard genetic code. 5 is invertebrate mitochondrial code
 length="418"            # the expected length of an amplicon
 basesvariation="9"      # allowed length variation (bp) from the expected length of an amplicon
 taxgroups="undefined"   # (optional); if sequence binning is to be performed on
                            # a per-taxon basis (as in specifications file)
                            # then specify the taxon grouping file
 NA_abund_thresh="0.05"  # verifiednonauthentic_retained_p < 0.05 (value from mateMATE results);
                            # the allowed abundance threshold of
                            # non-validated OTUs/ASVs in the filtered dataset.
 abundance_filt="TRUE"  # TRUE/FALSE; if FALSE, then NA_abund_thresh is ineffective,
                            # and no filtering is done based on the ASV abundances,
                            # i.e., filter only based on length, basesvariation and genetic_code.
                            # FALSE may be used when the seq-depth for the target taxa is low.
                            # If TRUE, then NA_abund_thresh will be applied.

 ##

 # check if taxgroups is specified, if not then this var is empty.
 if [[ $taxgroups != "undefined" ]]; then
     taxgroups=$"--taxgroups $taxgroups"
 else
     taxgroups=$""
 fi

 #output dir
 output_dir=$"metamate_out"
 echo "output_dir = $output_dir"
 # remove old $output_dir if exists
 if [[ -d $output_dir ]]; then
     rm -rf $output_dir
 fi

 # if perfoming clade binning, then WARNING when processing more than 65,536 ASVs
 ASVcount=$(grep -c "^>" $ASV_fasta)
 if (( $ASVcount > 65536 )); then
     printf '%s\n' "WARNING]: clade binning NOT performed,
      because the input ASVs limit is 65,536 for that.
      Current input has $ASVcount ASVs."
 fi

 # check abundance_filt;
  # if FALSE then make new specifications file, that excludes abundance filtering
 if [[ $abundance_filt == "FALSE" ]]; then
     printf '%s\n' "[library; n; 0-1/2]" > specifications0.txt
     specifications=$(realpath specifications0.txt)
 fi

 # quick check of the specifications file, has to contain "library" | "total" | "clade" | "taxon"
 if ! grep -q -e "library" -e "total" -e "clade" -e "taxon" $specifications; then
     printf '%s\n' "ERROR]: specifications file seems to be wrong.
      Does not contain any of the terms (library, total, clade, taxon)."
 fi

 ### metaMATE-find
 printf "# Running metaMATE-find\n"
 metamate find \
     --asvs $ASV_fasta \
     --readmap $ASV_table \
     --specification $specifications \
     --references $reference_database \
     --expectedlength $length \
     --basesvariation $basesvariation \
     --table $genetic_code \
     --threads 8 \
     --output $output_dir \
     --overwrite $taxgroups \
     --realign

 # check for the presence of "metamate_out" dir and "resultcache" file (did metaMATE-find finish)
 if [[ -d $output_dir ]] && [[ -e $output_dir/resultcache ]] && [[ -e $output_dir/results.csv ]]; then
     printf '\n%s\n\n' "metaMATE-find finished, proceed"
     # export variables for below script (Rscript)
     if [[ $abundance_filt != "FALSE" ]]; then
         printf '%s\n' "exporting NA_abund_thresh of $NA_abund_thresh for metaMATE-dump"
         export NA_abund_thresh
     else
         export output_dir
         export abundance_filt
     fi
 else
     printf '%s\n' "ERROR]: cannot find the $output_dir (metaMATE-find output)
      to start metaMATE-dump OR no authentic ASVs found??"
 fi

get the results_index from the metamate_out/results.csv file

 #!/usr/bin/env Rscript

 ## read results.csv
 output_dir = Sys.getenv('output_dir') # = "metamate_out" as specified above
 find_results = read.csv(file.path(output_dir, "results.csv"))

 # get variables
 abundance_filt = Sys.getenv('abundance_filt')

 ## filter results if abundance_filt is FALSE
 if (abundance_filt == "FALSE"){
     result_index = "0" # get first result_index (library_n = 0)
     write(result_index, file.path(output_dir, "selected_result_index.txt"))
 }

 ## filter results based on NA_abund_thresh
 if (abundance_filt != "FALSE"){
     NA_abund_thresh = as.numeric(Sys.getenv('NA_abund_thresh'))
     filtered_data = find_results[
                     find_results$verifiednonauthentic_retained_p <= NA_abund_thresh, ]

     # if no results correspond with the NA_abund_thresh, then get the next best
         # else, just select the result_index that corresponds to
             # NA_abund_thresh with highest accuracy_score
     if (nrow(filtered_data) == 0) {
         cat(
         "\n no results correspond with the NA_abund_thresh of", NA_abund_thresh, ";
         getting the next best setting\n"
         )
         next_best = min(find_results$verifiednonauthentic_retained_p)
         filtered_data = find_results[
                         find_results$verifiednonauthentic_retained_p <= next_best, ]
         # sort based on accuracy_score
         sorted_filtered = filtered_data[order(-filtered_data$accuracy_score), ]
         # get the result with the highest accuracy_score
         metamate_selected_threshold = sorted_filtered[1,]
         write.csv(metamate_selected_threshold, file.path(output_dir, "next_best_set.csv"),
                                             quote = F)
         # the result_index of the NA_abund_thresh with the highest accuracy_score
         result_index = metamate_selected_threshold[,1]
         write(result_index, file.path(output_dir, "selected_result_index.txt"))
     } else {
         # sort based on accuracy_score
         sorted_filtered = filtered_data[order(-filtered_data$accuracy_score), ]
         # get the result with the highest accuracy_score
         metamate_selected_threshold = sorted_filtered[1,]
         # the result_index of the NA_abund_thresh with the highest accuracy_score
         result_index = metamate_selected_threshold[,1]
         write(result_index, file.path(output_dir, "selected_result_index.txt"))
     }
 }

run metaMATE-dump to discard putative artefact ASVs

 #!/bin/bash

 ## metaMATE-dump
 ASV_fasta=$(basename $ASV_fasta)

 # read result_index
 read -r result_index < $output_dir/selected_result_index.txt
 printf '%s\n' " - selected result_index = $result_index"

 # run metaMATE-dump
 printf '%s\n' "# Running metaMATE-dump"
 metamate dump \
 --asvs $ASV_fasta \
 --resultcache $output_dir/resultcache \
 --output $output_dir/${ASV_fasta%.*}_metaMATE.filt \
 --overwrite \
 --resultindex $result_index

 # generate a list of ASV IDs
 seqkit seq -n $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta > \
                     $output_dir/${ASV_fasta%.*}_metaMATE.filt.list

 # filter the ASV table; include only the ASVs that are in ${ASV_fasta%.*}_metaMATE.filt.list
 awk -v var="$output_dir/${ASV_fasta%.*}" 'NR==1; NR>1 {print $0 | \
         "grep -Fwf "var"_metaMATE.filt.list"}' $ASV_table > \
          $output_dir/${ASV_table%.*}_metaMATE.filt.txt

 # filter the sintax.taxonomy.filt.txt file to include only ASVs retained by metaMATE
 awk -v var="$output_dir/${ASV_fasta%.*}" 'NR==1; NR>1 {print $0 | \
         "grep -Fwf "var"_metaMATE.filt.list"}' sintax.taxonomy.filt.txt > \
         $output_dir/sintax.taxonomy.metaMATE.filt.txt


 # write discarded ASVs list
 comm -3 <(sort <(seqkit seq -n $ASV_fasta)) \
         <(sort $output_dir/${ASV_fasta%.*}_metaMATE.filt.list) \
         > $output_dir/metaMATE.discarded.list

optionally rescue discarded ASVs that have GENUS level bootstrap value > 0.9

 #!/bin/bash

 # get discarded ASVs (sintax taxonomy list)
 grep -Fwf $output_dir/metaMATE.discarded.list sintax.taxonomy.filt.txt \
 > $output_dir/metaMATE.discarded.sintax.taxonomy.txt

 # get the rescued ASVs that have GENUS level bootstrap value >= 0.9
 awk -F'\t' '$26 >= 0.9' $output_dir/metaMATE.discarded.sintax.taxonomy.txt \
 > $output_dir/rescued.txt

 # check if rescued.txt exists and is not empty
 if [[ -s $output_dir/rescued.txt ]]; then
     # add the rescued ASVs to $output_dir/sintax.taxonomy.metaMATE.filt.txt
     cat $output_dir/rescued.txt >> $output_dir/sintax.taxonomy.metaMATE.filt.txt

     # add the rescued ASVs to $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta
     seqkit grep -w 0 -f <(awk -F'\t' '{print $1}' $output_dir/rescued.txt) $ASV_fasta \
         >> $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta

     # add the rescued ASVs to $output_dir/${ASV_table%.*}_metaMATE.filt.txt
     grep -wf <(awk -F'\t' '{print $1}' $output_dir/rescued.txt) $ASV_table \
         >> $output_dir/${ASV_table%.*}_metaMATE.filt.txt

     printf '%s\n' "Rescued $(wc -l < $output_dir/rescued.txt) ASVs"
 else
     printf '%s\n' "No ASVs to rescue"
 fi

Note

Herein case, the final filtered data is ASV_table_tax_filt_metaMATE.filt.txt and ASVs_tax_filt_metaMATE.filt.fasta in the metamate_out directory. The filtered SINTAX-classifier results (matching the ASVs in the latter files) is sintax.taxonomy.metaMATE.filt.txt in the metamate_out dir.

If deemed relevant, then you may proceed with the below workflow below that includes clustering ASVs to OTUs.

Clustering ASVs to OTUs

This step clusters ASVs to OTUs with vsearch.

get the size of ASVs

 #!/usr/bin/env Rscript

 # specify input ASVs table and fasta
 ASV_table="ASV_table_tax_filt_metaMATE.filt.txt" # specify ASV table file
 ASV_fasta="ASVs_tax_filt_metaMATE.filt.fasta"    # specify ASVs fasta file

 ################################
 library(Biostrings)
 # Read the ASV table
 ASV_table = read.table(ASV_table, sep = "\t", check.names = F,
                             header = T, row.names = 1)

 # add 'sum' column
 ASV_table$sum = rowSums(ASV_table)
 # make ASV_sums object
 ASV_sums = setNames(ASV_table$sum, rownames(ASV_table))

 # Read the FASTA file
 ASV_fasta = readDNAStringSet(ASV_fasta)

 # add ";size=*" to ASV_fasta
 names(ASV_fasta) = sapply(names(ASV_fasta), function(header) {
     paste0(header, ";size=", ASV_sums[header])
 })
 # write fasta file
 writeXStringSet(ASV_fasta, "ASVs.size.fasta",
                         width = max(width(ASV_fasta)))

clustering with vsearch

 #!/bin/bash

 # specify the clustering threshold
 clustering_thresh="0.97"

 # make output dir.
 output_dir="OTU_table"
 mkdir -p $output_dir
 export output_dir

 ### cluster ASVs using vsearch.
 vsearch --cluster_fast ASVs.size.fasta \
     --id $clustering_thresh \
     --iddef 2 \
     --sizein \
     --xsize \
     --fasta_width 0 \
     --centroids $output_dir/OTUs.fasta \
     --uc $output_dir/OTUs.uc

generate an OTU table based on the clustered ASVs (.uc file).

 #!/usr/bin/Rscript

 # specify input ASV table (the same one as for 'get the size of ASVs')
 ASV_table="ASV_table_tax_filt_metaMATE.filt.txt"

 # read output dir
 output_dir = Sys.getenv('output_dir')

 # read output from vsearch clustering (-uc OTU.uc)
 inp_UC = file.path(output_dir, "OTUs.uc")
 ################################
 library(data.table)
 # load input data - ASV table
 ASV_table = fread(file = ASV_table, header = TRUE, sep = "\t")

 ## Load input data - UC mapping file
 UC = fread(file = inp_UC, header = FALSE, sep = "\t")
 UC = UC[ V1 != "S" ]
 UC[, ASV := tstrsplit(V9, ";", keep = 1) ]
 UC[, OTU := tstrsplit(V10, ";", keep = 1) ]
 UC[V1 == "C", OTU := ASV ]
 UC = UC[, .(ASV, OTU)]

 # convert ASV table to long format
 ASV = melt(data = ASV_table,
     id.vars = colnames(ASV_table)[1],
     variable.name = "SampleID", value.name = "Abundance")
 ASV = ASV[ Abundance > 0 ]
  # add colnames, to make sure 1st is 'ASV'
 colnames(ASV) = c("ASV", "SampleID", "Abundance")

 # add OTU IDs
 ASV = merge(x = ASV, y = UC, by = "ASV", all.x = TRUE)
 # summarize
 OTU = ASV[ , .(Abundance = sum(Abundance, na.rm = TRUE)),
                             by = c("SampleID", "OTU")]

 # reshape OTU table to wide format
 OTU_table = dcast(data = ASV,
     formula = OTU ~ SampleID,
     value.var = "Abundance",
     fun.aggregate = sum, fill = 0)

 # write OTU table
  # OTU names correspond to most abundant ASV in an OTU
 fwrite(x = OTU_table, file = file.path(output_dir,
                                 "OTU_table.txt"), sep = "\t")

Post-clustering

Post-cluster OTUs with LULU to merge consistently co-occurring ‘daughter-OTUs’.

generate match list for post-clustering

 #!/bin/bash

 # go to directrory that contains OTUs
 cd $output_dir # 'OTU_table' in this case

 # make blast database for post-clustering
 makeblastdb -in OTUs.fasta -parse_seqids -dbtype nucl

 # generate match list for post-clustering
 blastn -db OTUs.fasta \
     -outfmt '6 qseqid sseqid pident' \
     -out match_list.txt \
     -qcov_hsp_perc 75 \
     -perc_identity 90 \
     -query OTUs.fasta \
     -num_threads 20

run LULU post-clustering

 #!/usr/bin/Rscript

 # specify minimum threshold of sequence similarity considering any OTU as an error of another
 min_match = "90"

 # specify OTU table
 OTU_table="OTU_table.txt"

 ################################
 library(devtools)
 # load OTU table and match list
 otutable = read.table(OTU_table, header = T, row.names = 1, sep = "\t")
 matchlist = read.table("match_list.txt")

 curated_result = lulu::lulu(otutable, matchlist,
     minimum_match = min_match)

 # write post-clustered OTU table to file
 curated_table = curated_result$curated_table
 curated_table = cbind(OTU = rownames(curated_table), curated_table)
 write.table(curated_table, file ="OTU_table_LULU.txt",
             sep = "\t", row.names = F, quote = FALSE)
 write.table(curated_result$discarded_otus,
             file ="merged_units.lulu", col.names = FALSE, quote = FALSE)

Note

Note that if the sample names start with a number, then the output OTU table may contain “X” prefix in the sample names.

match OTUs.fasta and taxonomy table with post-clustered table (OTU_table_LULU)

 #!/bin/bash

 # specify post-clustered table
 OTU_table="OTU_table_LULU.txt"
 # specify pre post-clustered OTUs fasta file
 OTUs_fasta="OTUs.fasta"

 # get matching OTUs
 awk 'NR>1{print $1}' $OTU_table > OTUs_LULU.list
 cat $OTUs_fasta | \
   seqkit grep -w 0 -f OTUs_LULU.list > OTUs_LULU.fasta

 # get matching sintax taxonomy results
 head -n 1 ../sintax.taxonomy.metaMATE.filt.txt > sintax.taxonomy.txt
 cat ../sintax.taxonomy.metaMATE.filt.txt | \
   grep -wf OTUs_LULU.list >> sintax.taxonomy.txt

 # remove unnecessary files
 rm OTUs.fasta.n*

 # move OTU_table two directories down
 cd ..
 mv $output_dir ../..

Note

The final OTUs data is OTU_table_LULU.txt and OTUs_LULU.fasta in the OTU_table directory.

The matching SINTAX taxonomy files are sintax.taxonomy.txt in the OTU_table directory.

We acknowledge CSC - IT Center for Science, Finland, for computational resources while building and testing this workflows.