.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png :width: 300 :target: https://biodiversitygenomics.eu/ .. |eufund| image:: _static/eu_co-funded.png :width: 200 .. |chfund| image:: _static/ch-logo-200x50.png :width: 210 .. |ukrifund| image:: _static/ukri-logo-200x59.png :width: 150 .. |logo_BGE_small| image:: _static/logo_BGE_alpha.png :width: 120 :target: https://biodiversitygenomics.eu/ .. |main_interface| image:: _static/main_interface.png :width: 2000 .. raw:: html .. role:: red .. raw:: html .. role:: yellow-background |logo_BGE_alpha| Arthropods/COI ************** This is executable step-by-step pipeline for **COI** amplicon data from Illumina sequencing machine. The bioinformatic workflow results in amplicon sequence variants (ASVs) and well as operational taxonomic units (OTUs). .. note:: The **full bioinformatics workflow can be automatically run through** `PipeCraft2 `_, a software package which represents a graphical user interface wrapper for all the bioinformatic steps below. `See the example workflow for COI `_. |main_interface| .. admonition:: Citation of the pipeline When using this pipeline, please cite as: Anslan, S, Bahram, M, Hiiesalu, I, Tedersoo, L. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data. Mol Ecol Resour. 2017; 17: e234-e240. https://doi.org/10.1111/1755-0998.12692 Also, please **cite the original resources of the wrapped software**. Dependencies ~~~~~~~~~~~~ +-----------------------------------------------------+----------+---------+ | Process | Software | Version | +=====================================================+==========+=========+ | :ref:`Remove primers ` | cutadapt | 4.9 | +-----------------------------------------------------+----------+---------+ | :ref:`Quality filtering ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+---------+ | :ref:`Denoise ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+---------+ | :ref:`Merge paired-end reads ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+---------+ | :ref:`Chimera filtering ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+---------+ | :ref:`Remove tag-jumps ` | UNCROSS2 | | +-----------------------------------------------------+----------+---------+ | :ref:`Merge sequencing runs* ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+---------+ | :ref:`Taxonomy assignment ` | vsearch | 2.28.1 | +-----------------------------------------------------+----------+---------+ | :ref:`Get target taxa ` | R | | +-----------------------------------------------------+----------+---------+ | :ref:`Remove NUMTs ` | metaMATE | 0.4.3 | +-----------------------------------------------------+----------+---------+ | :ref:`Clustering ASVs to OTUs ` | vsearch | 2.28.1 | +-----------------------------------------------------+----------+---------+ | :ref:`Post-clusteringlustering ` | BLAST | 2.12.0+ | +-----------------------------------------------------+----------+---------+ | :ref:`Post-clusteringlustering ` | LULU | 0.1.0+ | +-----------------------------------------------------+----------+---------+ \*only applicable when there are multiple sequencing runs per study. .. note:: All the dependencies/software of the pipeline are available on a `Docker image `_. | Download `Docker for windows `_ | Download `Docker for Mac `_ | Install Docker for Linux - `follow the guidelines under appropriate Linux distribution `_ .. code-block:: bash :caption: get the Docker image docker pull pipecraft/bioscanflow:2 .. code-block:: bash :caption: example of running the pipeline via Docker image # run docker # specify the files location with -v flag ($PWD = the current working directory) docker run -i --tty -v $PWD/:/Files pipecraft/bioscanflow:2 # inside the container, the files are accessible in the /Files dir cd Files # checking if cutadapt is available cutadapt -h # ready to run the pipe as below ... ## make sure that via the shared folder (-v) path you have access also to the reference databases. .. code-block:: bash :caption: convert docker image to apptainer (singularity) image for running the pipeline in HPC # pull the docker image and save it as a singularity image apptainer pull --name bioscanflow_2.sif docker://pipecraft/bioscanflow:2 # test if eg cutadapt is available apptainer exec bioscanflow_2.sif cutadapt -h # run the script via singularity image apptainer exec --bind $PWD:/Files bioscanflow_2.sif /Files/run_pipe.sh Data structure ~~~~~~~~~~~~~~ .. _multiRunDirCOI: Multiple sequencing runs ------------------------ .. important:: When aiming to combine samples from multiple sequencing runs, then follow the below directory structure **Directory structure:** | **/multiRunDir** *(directory names can be changed)* | ├── **/sequencing_set01** | │ ├── *sample1.R1.fastq* | │ ├── *sample1.R2.fastq* | │ ├── *sample2.R1.fastq* | │ ├── *sample2.R2.fastq* | │ ├── ... | ├── **/sequencing_set02** | │ ├── *sampleA.R1.fastq* | │ ├── *sampleA.R2.fastq* | │ ├── *sampleB.R1.fastq* | │ ├── *sampleB.R2.fastq* | │ ├── ... | └── **/sequencing_set03** | ├── *sample11.R1.fastq* | ├── *sample11.R2.fastq* | ├── *sample12.R1.fastq* | ├── *sample12.R2.fastq* | ├── ... .. note:: Fastq files with the **same name** will be considered as the same sample and will be merged in the "Merge sequencing runs" step. Single sequencing run --------------------- | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` | ____________________________________________________ .. _remove_primersCOI: Remove primers ~~~~~~~~~~~~~~ | Remove primer strings from paired-end data. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. note:: Here, assuming that all sequences are in 5'-3' orientation! *(3'-5' orient sequences will be discarded with this workflow)* .. important:: | - Paired-end files must contain "R1" and "R2" strings (not just _1 and _2)! | - Sample names must not contain "R1" and "R2" strings (i.e. not FR123_001_R1.fastq/FR123_001_R2.fastq) .. code-block:: bash :caption: remove primers with cutadapt :emphasize-lines: 27-33, 60-61 :linenos: #!/bin/bash ## workflow to remove primers with cutadapt # specify the identifier string for the R1 files read_R1="_R1" # specify primers fwd_primer=$"CCHGAYATRGCHTTYCCHCG" # this is the forward primer BF3 rev_primer=$"CDGGRTGNCCRAARAAYCA" # this is the reverse primer BR2 # specify primers #fwd_primer=$"GGWACWRGWTGRACWITITAYCCYCC" # this is the forward primer mICOIintF-XT #rev_primer=$"TAIACYTCIGGRTGICCRAARAAYCA" # this is the reverse primer jgHCO2198 # specify primers #fwd_primer=$"GGDACWGGWTGAACWGTWTAYCCHCC" # this is the forward primer FwhF2 #rev_primer=$"GTRATWGCHCCDGCTARWACWGG" # this is the reverse primer FwhR2n # edit primer trimming settings mismatches="2" # Numer of allowed mismatches in primer string search; # if set as 1, then allow 1 mismatch; # if set as 0.1, then allow mismatch in 10% of the bases. overlap="22" # The minimum overlap length. Keep it nearly as high # as the primer length to avoid short random matches. pair_filter="any" # Option 'any' discards a read pair if primers are not found in # either of the read pairs (R1 and R2). # Option 'both' keeps the read pair if a primer is found in # at least one of the read pairs. ## # get the reverse complementary of the primers # needed when the amplicon length is shorter than the sequencing cycle fwd_primer_rc=$(echo $fwd_primer | rev | tr "ACGTRYKMBDHV" "TGCAYRMKVHDB") rev_primer_rc=$(echo $rev_primer | rev | tr "ACGTRYKMBDHV" "TGCAYRMKVHDB") # get directory names if working with multiple sequencing runs # in that case, my working folder = /multiRunDir (see dir structure above) DIRS=$(ls -d *) # -> sequencing_set01 sequencing_set02 sequencing_set03 for sequencing_run in $DIRS; do printf "\nWorking with $sequencing_run \n" cd $sequencing_run #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # make output dirs mkdir -p primersCut_out mkdir -p primersCut_out/untrimmed ### Clip primers with cutadapt for inputR1 in *$read_R1*; do inputR2=$(echo $inputR1 | sed -e 's/R1/R2/') cutadapt --quiet \ -e $mismatches \ --minimum-length 32 \ --overlap $overlap \ --no-indels \ --cores=0 \ --untrimmed-output primersCut_out/untrimmed/$inputR1 \ --untrimmed-paired-output primersCut_out/untrimmed/$inputR2 \ --pair-filter=$pair_filter \ -g $fwd_primer \ -a $fwd_primer...$rev_primer_rc";optional" \ -G $rev_primer \ -A $rev_primer...$fwd_primer_rc";optional" \ -o primersCut_out/$inputR1 \ -p primersCut_out/$inputR2 \ $inputR1 $inputR2 done #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# cd .. done ____________________________________________________ .. _quality_filteringCOI: Quality filtering ~~~~~~~~~~~~~~~~~ | Quality filtering of the fastq files based on the allowed maximum error rate per sequence (as in DADA2). | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: quality filtering in DADA2 (in R) :emphasize-lines: 13-19, 67-71 :linenos: #!/usr/bin/Rscript ## workflow to perform quality filtering within DADA2 #load dada2 library library('dada2') # specify the identifier string for the R1 files read_R1 = "_R1" # get the identifier string for the R2 files read_R2 = gsub("R1", "R2", read_R1) # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # output path path_results = "qualFiltered_out" # input and output file paths R1s = sort(list.files("primersCut_out", pattern = read_R1, full.names = TRUE)) R2s = sort(list.files("primersCut_out", pattern = read_R2, full.names = TRUE)) #sample names sample_names = sapply(strsplit(basename(R1s), read_R1), `[`, 1) # filtered files path filtR1 = file.path(path_results, paste0(sample_names, ".R1.", "fastq.gz")) filtR2 = file.path(path_results, paste0(sample_names, ".R2.", "fastq.gz")) names(filtR1) = sample_names names(filtR2) = sample_names #quality filtering qfilt = filterAndTrim(R1s, filtR1, R2s, filtR2, maxN = 0, # max number of allowed N bases. maxEE = c(2, 2), # max error rate per R1 and R2 read, respectively. truncQ = 2, # truncate reads at the first instance of a quality score less than or equal to specified value. truncLen = c(0, 0), # truncate reads after specified length for R1 and R2 reads, respectively. maxLen = 600, # discard reads longer than specified. minLen = 100, # discard reads shorter than specified. minQ = 2, # discard reads (after truncation) that contain a quality score below specified value. matchIDs = TRUE, # output paired-end reads with matching IDs (for merging). compress = TRUE, # gzip the output multithread = TRUE) # use multiple threads saveRDS(qfilt, file.path(path_results, "qfilt_reads.rds")) # make sequence count report seq_count = cbind(qfilt) colnames(seq_count) = c("input", "qualFiltered") seq_count = as.data.frame(seq_count) seq_count$sample = sample_names # reorder columns seq_count = seq_count[, c("sample", "input", "qualFiltered")] write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = FALSE, quote = FALSE) # save filtered R objects for denoising and merging (below) filtR1 = sort(list.files(path_results, pattern = ".R1.fastq.gz", full.names = TRUE)) filtR2 = sort(list.files(path_results, pattern = ".R2.fastq.gz", full.names = TRUE)) sample_names = sapply(strsplit(basename(filtR1), ".R1.fastq.gz"), `[`, 1) saveRDS(filtR1, file.path(path_results, "filtR1.rds")) saveRDS(filtR2, file.path(path_results, "filtR2.rds")) saveRDS(sample_names, file.path(path_results, "sample_names.rds")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #set working directory back to "/multiRunDir" setwd(wd) i = i + 1 } } ____________________________________________________ .. _denoiseCOI: Denoise and merge paired-end reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Denoise and merge paired-end Illumina reads as in DADA2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: denoise and merge paired-end reads in DADA2 :emphasize-lines: 7-13, 73-77 :linenos: #!/usr/bin/Rscript ## workflow to perform DADA2 denoising and merging # load dada2 library library('dada2') # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #load quality filtered files filtR1 = readRDS("qualFiltered_out/filtR1.rds") filtR2 = readRDS("qualFiltered_out/filtR2.rds") qfilt = readRDS("qualFiltered_out/qfilt_reads.rds") sample_names = readRDS("qualFiltered_out/sample_names.rds") cat("\n sample names = ", sample_names, "\n ") names(filtR1) = sample_names names(filtR2) = sample_names # create output dir path_results = "denoised_merged" dir.create(path_results, showWarnings = FALSE) print("# Denoising ...") # learn the error rates errF = learnErrors(filtR1, multithread = TRUE) errR = learnErrors(filtR2, multithread = TRUE) # make error rate figures pdf(file.path(path_results, "Error_rates_R1.pdf")) print( plotErrors(errF) ) dev.off() pdf(file.path(path_results, "Error_rates_R2.pdf")) print( plotErrors(errR) ) dev.off() # Sample inference and merger of paired-end reads mergers = vector("list", length(sample_names)) names(mergers) = sample_names for(sample in sample_names) { cat("\n -- Processing:", sample, "\n") derepF = derepFastq(filtR1[[sample]]) ddF = dada(derepF, err = errF, multithread = TRUE) derepR = derepFastq(filtR2[[sample]]) ddR = dada(derepR, err = errR, multithread = TRUE) merger = mergePairs(ddF, derepF, ddR, derepR) mergers[[sample]] = merger } rm(derepF); rm(derepR) gc() saveRDS(mergers, (file.path(path_results, "mergers.rds"))) # make sequence table ASV_tab = makeSequenceTable(mergers) #write RDS object saveRDS(ASV_tab, (file.path(path_results, "rawASV_table.rds"))) # make sequence count report getN = function(x) sum(getUniques(x)) #remove 0 seqs samples from qfilt statistics row_sub = apply(qfilt, 1, function(row) all(row !=0 )) qfilt = qfilt[row_sub, ] seq_count = cbind(qfilt, sapply(mergers, getN)) colnames(seq_count) = c("input", "qualFiltered", "denoised_and_merged") rownames(seq_count) = sample_names write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = TRUE, quote = FALSE) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _remove_chimerasCOI: Chimera filtering ~~~~~~~~~~~~~~~~~ | Remove putative chimeras with DADA2 'consensus' mode. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: remove chimeras in DADA2 :emphasize-lines: 14-20, 97-100 :linenos: #!/usr/bin/Rscript ## workflow to perform chimera filtering within DADA2 # load libraries library('dada2') library('openssl') # chimera filtering method method = "consensus" # collapse ASVs that have no mismatshes or internal indels (identical up to shifts and/or length) collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load denoised and merged ASVs rawASV_table = readRDS("denoised_merged/rawASV_table.rds") # create output dir path_results="ASV_table" dir.create(path_results, showWarnings = FALSE) # Remove chimeras print("Removing chimeric ASVs ...") chim_filt = removeBimeraDenovo( rawASV_table, method = method, multithread = TRUE, verbose = TRUE) saveRDS(chim_filt, "ASV_table/chim_filt.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(chim_filt) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tchim_filt = t(chim_filt) # add sequences to 1st column tchim_filt = cbind(row.names(tchim_filt), tchim_filt) colnames(tchim_filt)[1] = "Sequence" # row names as sequence headers row.names(tchim_filt) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs.fasta")) # write ASVs table to path_results write.table(tchim_filt, file.path(path_results, "ASV_table.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) ### collapse ASVs that have no mismatshes or internal indels # (identical up to shifts and/or length) if (collapseNoMismatch == "true") { print("Collapsing identical ASVs ...") ASV_tab_collapsed = collapseNoMismatch(chim_filt, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(ASV_tab_collapsed, file.path(path_results, "ASV_table_collapsed.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(ASV_tab_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tASV_tab_collapsed = t(ASV_tab_collapsed) # add sequences to 1st column tASV_tab_collapsed = cbind(row.names(tASV_tab_collapsed), tASV_tab_collapsed) colnames(tASV_tab_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tASV_tab_collapsed) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_collapsed.fasta")) # write ASVs table to path_results write.table(tASV_tab_collapsed, file.path(path_results, "ASVs_table_collapsed.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(ASV_tab_collapsed)), " chimera filtered (+collapsed) ASVs, with a total of ", sum(rowSums(ASV_tab_collapsed)), " sequences.")) print("--------") } else { # print summary print(paste0("Output = ", length(colnames(chim_filt)), " chimera filtered ASVs, with a total of ", sum(rowSums(chim_filt)), " sequences.")) print("--------") } #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# setwd(wd) i = i + 1 } } ____________________________________________________ .. _tagjumpsCOI: Remove tag-jumps ~~~~~~~~~~~~~~~~ | Remove putative tag-jumps with UNCROSS2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: removing putative tag-jumps with UNCROSS2 method :emphasize-lines: 15-21, 115-119 :linenos: #!/usr/bin/Rscript ## Script to perform tag-jump removal; (C) Vladimir Mikryukov, # edit, Sten Anslan # load libraries library(data.table) # set parameters set_f = 0.03 # f-parameter of UNCROSS (e.g., 0.03) set_p = 1 # p-parameter (e.g., 1.0) # output dir path_results="ASV_table" # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load ASV table # loading ASV_table_collapsed if collapseNoMismatch was "true" (above) if (file.exists("ASV_table/ASV_table_collapsed.rds") == TRUE) { tab = readRDS("ASV_table/ASV_table_collapsed.rds") cat("input table = ASV_table/ASV_table_collapsed.rds\n") } else { # loading chimera filtered ASV table tab = readRDS("ASV_table/chim_filt.rds") cat("input table = ASV_table/chim_filt.rds\n") } # format ASV table ASVTABW = as.data.table(t(tab), keep.rownames = TRUE) colnames(ASVTABW)[1] = "ASV" # convert to long format ASVTAB = melt(data = ASVTABW, id.vars = "ASV", variable.name = "SampleID", value.name = "Abundance") # remove zero-OTUs ASVTAB = ASVTAB[ Abundance > 0 ] # estimate total abundance of sequence per plate ASVTAB[ , Total := sum(Abundance, na.rm = TRUE), by = "ASV" ] ## UNCROSS score uncross_score = function(x, N, n, f = 0.01, tmin = 0.1, p = 1){ z = f * N / n # Expected treshold sc = 2 / (1 + exp(x/z)^p) # t-score res = data.table(Score = sc, TagJump = sc >= tmin) return(res) } # esimate UNCROSS score cat(" estimating UNCROSS score\n") ASVTAB = cbind( ASVTAB, uncross_score( x = ASVTAB$Abundance, N = ASVTAB$Total, n = length(unique(ASVTAB$SampleID)), f = as.numeric(set_f), p = as.numeric(set_p) ) ) cat(" number of tag-jumps: ", sum(ASVTAB$TagJump, na.rm = TRUE), "\n") # tag-jump stats TJ = data.table( Total_reads = sum(ASVTAB$Abundance), Number_of_TagJump_Events = sum(ASVTAB$TagJump), TagJump_reads = sum(ASVTAB[ TagJump == TRUE ]$Abundance, na.rm = T)) TJ$ReadPercent_removed = with(TJ, (TagJump_reads / Total_reads * 100)) fwrite(x = TJ, file = "ASV_table/TagJump_stats.txt", sep = "\t") # prepare ASV tables, remove tag-jumps ASVTAB = ASVTAB[ TagJump == FALSE ] # convert to wide format RES = dcast(data = ASVTAB, formula = ASV ~ SampleID, value.var = "Abundance", fill = 0) # sort rows (by total abundance) clz = colnames(RES)[-1] otu_sums = rowSums(RES[, ..clz], na.rm = TRUE) RES = RES[ order(otu_sums, decreasing = TRUE) ] # output table that is compadible with dada2 output = as.matrix(RES, sep = "\t", header = TRUE, rownames = 1, check.names = FALSE, quote = FALSE) output = t(output) saveRDS(output, ("ASV_table/ASV_table_TagJumpFiltered.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(output) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table toutput = t(output) # add sequences to 1st column toutput = cbind(row.names(toutput), toutput) colnames(toutput)[1] = "Sequence" #row names as sequence headers row.names(toutput) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_TagJumpFiltered.fasta")) # write ASVs table to path_results write.table(toutput, file.path(path_results, "ASV_table_TagJumpFiltered.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(output)), " ASVs, with a total of ", sum(rowSums(output)), " sequences.")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _mergeRunsCOI: Merge sequencing runs ~~~~~~~~~~~~~~~~~~~~~ | If previous processing was applied on :ref:`multiple sequencing runs ` , then here, | merge those sequenceing runs to form a single, unified ASV table. | Assuming that tag-jump filtering was performed (inputs = ASV_table_TagJumpFiltered.rds) .. code-block:: R :caption: merge ASV tables from multiple sequencing runs :emphasize-lines: 1-88 :linenos: #!/usr/bin/Rscript ## Merge sequencing runs, if working with multiple ones # load libraries library('dada2') # after merging multiple ASV tables ... # collapse ASVs that have no mismatshes or internal indels collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) tables = c() # load tables from multiple sequencing runs (dirs) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) tables = append(tables, print(file.path(paste0(dirs[i], "/ASV_table"), "ASV_table_TagJumpFiltered.rds"))) setwd(wd) i = i + 1 } } # Merge multiple ASV tables print("# Merging multiple ASV tables ...") ASV_tables = lapply(tables, readRDS) merged_table = mergeSequenceTables(tables = ASV_tables, repeats = "sum", tryRC = FALSE) ### collapse ASVs that have no mismatshes or internal indels if (collapseNoMismatch == "true") { print("# Collapsing identical ASVs ...") merged_table_collapsed = collapseNoMismatch(merged_table, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(merged_table_collapsed, "merged_table_collapsed.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table_collapsed = t(merged_table_collapsed) # add sequences to 1st column tmerged_table_collapsed = cbind(row.names(tmerged_table_collapsed), tmerged_table_collapsed) colnames(tmerged_table_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table_collapsed) = asv_headers # write ASVs.fasta asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged_collapsed.fasta") # write ASVs table write.table(tmerged_table_collapsed, "ASV_table.merged_collapsed.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table_collapsed)), " ASVs, with a total of ", sum(rowSums(merged_table_collapsed)), " sequences.")) } else { saveRDS(merged_table, "merged_table.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table = t(merged_table) # add sequences to 1st column tmerged_table = cbind(row.names(tmerged_table), tmerged_table) colnames(tmerged_table)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged.fasta") # write ASVs table to path_results write.table(tmerged_table, "ASV_table.merged.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table)), " ASVs, with a total of ", sum(rowSums(merged_table)), " sequences.")) } ____________________________________________________ .. _taxAssignCOI: Taxonomy assignment ~~~~~~~~~~~~~~~~~~~ | Taxonomy assignment with SINTAX (vsearch) against `BOLDistilled database. `_ | **---** `Download the latest BOLDistilled database for SINTAX here `_ **---** .. code-block:: bash :caption: assign taxonomy with SINTAX :linenos: #!/bin/bash # download the BOLDistilled reference database wget \ "https://us-sea-1.linodeobjects.com/boldistilled/sintax.zip" # unzip the database unzip sintax.zip # specify reference database for SINTAX reference_database="sintax/BOLDistilled_COI_Oct2025_SEQUENCES_sintax.fasta" reference_database=$(realpath $reference_database) # get database names with full path # specify input fasta file cd ASV_table ASV_fasta="ASVs_TagJumpFiltered.fasta" ASV_fasta_tmp="ASVs_TagJumpFiltered_minmax.fasta" # select by size to only retain ASVs that are within the range of expected variation. In this case we set it to 400 up to 430 bps vsearch --fastx_filter $ASV_fasta \ --fastq_minlen 400 \ --fastq_maxlen 430 \ --fastaout $ASV_fasta_tmp mv $ASV_fasta_tmp $ASV_fasta # Run SINTAX classification time vsearch --sintax $ASV_fasta \ --db $reference_database \ --tabbedout SINTAX.taxonomy.txt \ --sintax_cutoff 0.8 \ --threads 16 ____________________________________________________ .. _sorttaxaCOI: Get target taxa ~~~~~~~~~~~~~~~ | This part filters the ASV dataset to include only target taxonomic group for the following analyses. | For example, if you are interested in Hymenoptera, then discard all ASVs that do not match to the target taxon based on the user defined threshold (default = 0.8). | | :ref:`See below if no pre-selection is preferred ` .. code-block:: R :caption: get only target taxon annotations :linenos: #!/usr/bin/env Rscript ### Filter dataset based on SINTAX results to include target taxa # specify taxon and threshold taxon="Animalia" # target taxonomic group(s); # multiple groups should be from the same taxonomic level # separator is "," (e.g., "Hymenoptera, Lepidoptera") tax_level="kingdom" # allowed levels: kingdom | phylum | class | order | family | genus threshold="0.8" # threshold for considering an ASV as a target taxon class_threshold = 0.8 # threshold for class level identification # specify the ASV table and ASVs.fasta file that would be filtered to include only target taxa ASV_fasta = "ASVs_TagJumpFiltered.fasta" ASV_table = "ASV_table_TagJumpFiltered.txt" # specify the SINTAX-classifier output file (taxonomy file) taxtab="SINTAX.taxonomy.txt" #--------------------------------------# library(stringr) library(dplyr) library(Biostrings) # Function to parse SINTAX taxonomy format from vsearch output parse_sintax = function(tax_string) { # Initialize result with NAs result = list( kingdom = NA, kingdom_conf = 0, phylum = NA, phylum_conf = 0, class = NA, class_conf = 0, order = NA, order_conf = 0, family = NA, family_conf = 0, genus = NA, genus_conf = 0, species = NA, species_conf = 0 ) if (is.na(tax_string) || tax_string == "" || tax_string == "*") { return(result) } # Split by comma ranks = strsplit(tax_string, ",")[[1]] for (rank in ranks) { # Extract rank prefix (d:, k:, p:, c:, o:, f:, g:, s:) if (grepl("^d:", rank)) { # Domain (skip, not used) next } else if (grepl("^k:", rank)) { # Kingdom match = regmatches(rank, regexec("k:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$kingdom = match[2] result$kingdom_conf = as.numeric(match[3]) } } else if (grepl("^p:", rank)) { # Phylum match = regmatches(rank, regexec("p:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$phylum = match[2] result$phylum_conf = as.numeric(match[3]) } } else if (grepl("^c:", rank)) { # Class match = regmatches(rank, regexec("c:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$class = match[2] result$class_conf = as.numeric(match[3]) } } else if (grepl("^o:", rank)) { # Order match = regmatches(rank, regexec("o:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$order = match[2] result$order_conf = as.numeric(match[3]) } } else if (grepl("^f:", rank)) { # Family match = regmatches(rank, regexec("f:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$family = match[2] result$family_conf = as.numeric(match[3]) } } else if (grepl("^g:", rank)) { # Genus match = regmatches(rank, regexec("g:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$genus = match[2] result$genus_conf = as.numeric(match[3]) } } else if (grepl("^s:", rank)) { # Species match = regmatches(rank, regexec("s:([^(]+)\\(([0-9.]+)\\)", rank))[[1]] if (length(match) == 3) { result$species = match[2] result$species_conf = as.numeric(match[3]) } } } return(result) } # read ASV table table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1) # read SINTAX taxonomy table (vsearch --sintax output format) # Format: ASV_ID \t taxonomy_string \t strand \t other_columns tax_raw = read.table(taxtab, sep = "\t", check.names = F, header = F, stringsAsFactors = F, quote = "", comment.char = "", fill = TRUE) # Take first two columns only (ASV_ID and taxonomy) tax_raw = tax_raw[, 1:2] colnames(tax_raw) = c("ASV", "taxonomy") rownames(tax_raw) = tax_raw$ASV cat("\n Input =", nrow(tax_raw), "features.\n") # Parse SINTAX taxonomy strings tax_list = lapply(tax_raw$taxonomy, parse_sintax) tax = do.call(rbind, lapply(tax_list, as.data.frame)) rownames(tax) = tax_raw$ASV # taxon list taxon_list = strsplit(taxon, ", ")[[1]] ### extract only target-taxon ASVs from the 'raw' SINTAX results tax_filtered = tax %>% filter(.data[[tax_level]] %in% taxon_list) cat("\n Found", nrow(tax_filtered), "ASVs matching", taxon, "at", tax_level, "level.\n") ### Apply additional filter: class must be identified (confidence >= class_threshold) tax_filtered = tax_filtered %>% filter(class_conf >= class_threshold & !is.na(class)) cat(" After filtering for class identification (threshold >=", class_threshold, "):", nrow(tax_filtered), "ASVs retained.\n") ### change all tax ranks to "unclassified_*" when # the confidence values is less than the specified threshold # kingdom tax_filtered = tax_filtered %>% mutate(kingdom = ifelse(kingdom_conf < threshold | is.na(kingdom), "unclassified_root", as.character(kingdom))) # phylum tax_filtered = tax_filtered %>% mutate(phylum = ifelse(phylum_conf < threshold | is.na(phylum), paste0("unclassified_", kingdom), as.character(phylum))) tax_filtered$phylum = stringr::str_replace(tax_filtered$phylum, "unclassified_unclassified_", "unclassified_") # class tax_filtered = tax_filtered %>% mutate(class = ifelse(class_conf < threshold | is.na(class), paste0("unclassified_", phylum), as.character(class))) tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_", "unclassified_") # order tax_filtered = tax_filtered %>% mutate(order = ifelse(order_conf < threshold | is.na(order), paste0("unclassified_", class), as.character(order))) tax_filtered$order = stringr::str_replace(tax_filtered$order, "unclassified_unclassified_", "unclassified_") # family tax_filtered = tax_filtered %>% mutate(family = ifelse(family_conf < threshold | is.na(family), paste0("unclassified_", order), as.character(family))) tax_filtered$family = stringr::str_replace(tax_filtered$family, "unclassified_unclassified_", "unclassified_") # genus tax_filtered = tax_filtered %>% mutate(genus = ifelse(genus_conf < threshold | is.na(genus), paste0("unclassified_", family), as.character(genus))) tax_filtered$genus = stringr::str_replace(tax_filtered$genus, "unclassified_unclassified_", "unclassified_") # species to genus_sp when the confidence values is < 0.9 tax_filtered = tax_filtered %>% mutate(species = ifelse(species_conf < 0.9 | is.na(species), paste0(genus, "_sp"), as.character(species))) ### count occurrences of each taxon in df (SINTAX results) count_taxa = function(df, taxa) { sapply(taxa, function(taxon) sum(apply(df, 1, function(row) any(row == taxon)))) } taxon_counts = count_taxa(tax_filtered, taxon_list) # Check the counts if (all(taxon_counts == 0)) { print("ERROR: None of the specified taxa are present in the SINTAX results.") } else { if (any(taxon_counts == 0)) { warning("One or more of the specified taxa are not present in the SINTAX results.") } cat("\n Taxon counts:\n") print(taxon_counts) } ### extract only target-taxon ASVs from the 'threshold filtered' SINTAX results tax_filtered_thresh = tax_filtered %>% filter(.data[[tax_level]] %in% taxon_list) # Remove confidence columns for output tax_filtered_output = tax_filtered_thresh %>% select(kingdom, phylum, class, order, family, genus, species) # write filtered SINTAX taxonomy table tax_filtered_output = cbind(ASV = rownames(tax_filtered_output), tax_filtered_output) write.table(tax_filtered_output, file = "SINTAX.taxonomy.filt.txt", quote = F, row.names = F, sep = "\t") ### filter the ASV table to match ASVs that were kept in the tax_filtered table table_filt = table[rownames(table) %in% rownames(tax_filtered_thresh), ] ### check ASV table; if 1st col is sequence, then remove it for metaMATE if (colnames(table_filt)[1] == "Sequence") { cat("\n;; 1st column was 'Sequence', removing this ... \n") table_filt = table_filt[, -1] } # write filtered table table_filt = cbind(ASV = rownames(table_filt), table_filt) write.table(table_filt, file = paste0(sub("\\.[^.]*$", "_tax_filt.txt", ASV_table)), quote = F, row.names = F, sep = "\t") # filter ASV_fasta fasta = readDNAStringSet(ASV_fasta) fasta.tax_filt = fasta[names(fasta) %in% rownames(table_filt)] # write filtered ASV_fasta writeXStringSet(fasta.tax_filt, paste0(sub("\\.[^.]*$", "_tax_filt.fasta", ASV_fasta)), width = max(width(fasta.tax_filt))) ____________________________________________________ .. _donotsorttaxaCOI: **If no pre-selection is preferred, then just remove "Sequence" column from the ASV table** .. code-block:: R :caption: remove "Sequence" column from the ASV table :linenos: # read ASV table ASV_table = "ASV_table.txt" table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1) # check ASV table; if 1st col is sequence, then remove it for metaMATE if (colnames(table)[1] == "Sequence") { cat("## removing 'Sequence' column ... \n") table = table[, -1] # write filtered table table_filt = cbind(ASV = rownames(table), table) write.table(table_filt, file = paste0(sub("\\.[^.]*$", ".noSeq.txt", ASV_table)), quote = F, row.names = F, sep = "\t") } else { cat("## there was no 'Sequence' column; proceed with the current table ... \n") } ____________________________________________________ .. _numtsCOI: Remove NUMTs ~~~~~~~~~~~~ | Remove putative NUMTs with metaMATE. | This follows the workflow to automatically filter the ASVs by retaining maximum of 5% of estimated non-authentic-ASVs (verifiednonauthentic_retained_p < 0.05). .. important:: 1. metaMATE expects specifications file that states the filtering strategies. See `more info here. `_ Here, we will be using the metaMATE's `default specifications.txt file. `_ 1. metaMATE requires a reference COI database to determine verified-authentic ASVs. Herein using `BOLDistilled database. `_ --- `Download the latest BOLDistilled database here (click) `_ --- If you have your own set of reference sequences, then use those; or merge those with the other databases (such as the above one) to extend the ref. database. Check `standard genetic codes here `_ for ``genetic_code`` setting below. .. code-block:: bash :caption: get required specifications file and ref database :linenos: #!/bin/bash # download the default specifications file, # using this in metaMATE-find wget "https://raw.githubusercontent.com/tjcreedy/metamate/main/specifications.txt" # specify specifications file for metaMATE specifications="specifications.txt" specifications=$(realpath $specifications) # get full directory path # download the BOLDistilled reference databse wget \ "https://us-sea-1.linodeobjects.com/boldistilled/sintax.zip" # unzip the database and edit name unzip sintax.zip # specify reference database for metaMATE reference_database="sintax/BOLDistilled_COI_Oct2025_SEQUENCES_sintax.fasta" reference_database=$(realpath $reference_database) # get full directory path .. code-block:: bash :caption: cluster ASVs at a 90% similarity threshold for abundance filtering :linenos: #!/bin/bash ## run metaMATE-find # cluster with 10% threshold vsearch --cluster_fast ASVs_TagJumpFiltered_tax_filt.fasta --id 0.9 \ --uc ASVs_TagJumpFiltered_tax_filt_clustered.uc # select only H & S cat ASVs_TagJumpFiltered_tax_filt_clustered.uc | grep -v "^C" \ > ASVs_TagJumpFiltered_tax_filt_clustered_onlyHS.uc # now extract the information to match the input requirements from metamate awk -F'\t' 'BEGIN {OFS=","} {print $9, $2}' ASVs_TagJumpFiltered_tax_filt_clustered_onlyHS.uc \ > ASV_to_cluster_map.csv .. code-block:: bash :caption: run metaMATE-find :linenos: #!/bin/bash ## run metaMATE-find ## go to the directory that hosts your ASVs.fasta and ASV table files. # specify input ASVs table and fasta ASV_table="ASV_table_TagJumpFiltered_tax_filt.txt" # make sure that the 2nd col is not "Sequence" ASV_fasta="ASVs_TagJumpFiltered_tax_filt.fasta" # specify ASVs fasta file taxgroups="ASV_to_cluster_map.csv" # comment out or change filename if sequence binning is done in another way # specify variables genetic_code="5" # the standard genetic code. 5 is invertebrate mitochondrial code length="418" # the expected length of an amplicon basesvariation="9" # allowed length variation (bp) from the expected length of an amplicon taxgroups="undefined" # (optional); if sequence binning is to be performed on # a per-taxon basis (as in specifications file) # then specify the taxon grouping file NA_abund_thresh="0.05" # verifiednonauthentic_retained_p < 0.05 (value from mateMATE results); # the allowed abundance threshold of # non-validated OTUs/ASVs in the filtered dataset. abundance_filt="TRUE" # TRUE/FALSE; if FALSE, then NA_abund_thresh is ineffective, # and no filtering is done based on the ASV abundances, # i.e., filter only based on length, basesvariation and genetic_code. # FALSE may be used when the seq-depth for the target taxa is low. # If TRUE, then NA_abund_thresh will be applied. ## # check if taxgroups is specified, if not then this var is empty. if [[ $taxgroups != "undefined" ]]; then taxgroups=$"--taxgroups $taxgroups" else taxgroups=$"" fi #output dir output_dir=$"metamate_out" echo "output_dir = $output_dir" # remove old $output_dir if exists if [[ -d $output_dir ]]; then rm -rf $output_dir fi # if perfoming clade binning, then WARNING when processing more than 65,536 ASVs ASVcount=$(grep -c "^>" $ASV_fasta) if (( $ASVcount > 65536 )); then printf '%s\n' "WARNING]: clade binning NOT performed, because the input ASVs limit is 65,536 for that. Current input has $ASVcount ASVs." fi # check abundance_filt; # if FALSE then make new specifications file, that excludes abundance filtering if [[ $abundance_filt == "FALSE" ]]; then printf '%s\n' "[library; n; 0-1/2]" > specifications0.txt specifications=$(realpath specifications0.txt) fi # quick check of the specifications file, has to contain "library" | "total" | "clade" | "taxon" if ! grep -q -e "library" -e "total" -e "clade" -e "taxon" $specifications; then printf '%s\n' "ERROR]: specifications file seems to be wrong. Does not contain any of the terms (library, total, clade, taxon)." fi ### metaMATE-find printf "# Running metaMATE-find\n" metamate find \ --asvs $ASV_fasta \ --readmap $ASV_table \ --specification $specifications \ --references $reference_database \ --expectedlength $length \ --basesvariation $basesvariation \ --table $genetic_code \ --threads 8 \ --output $output_dir \ --overwrite $taxgroups \ --realign # check for the presence of "metamate_out" dir and "resultcache" file (did metaMATE-find finish) if [[ -d $output_dir ]] && [[ -e $output_dir/resultcache ]] && [[ -e $output_dir/results.csv ]]; then printf '\n%s\n\n' "metaMATE-find finished, proceed" # export variables for below script (Rscript) if [[ $abundance_filt != "FALSE" ]]; then printf '%s\n' "exporting NA_abund_thresh of $NA_abund_thresh for metaMATE-dump" export NA_abund_thresh else export output_dir export abundance_filt fi else printf '%s\n' "ERROR]: cannot find the $output_dir (metaMATE-find output) to start metaMATE-dump OR no authentic ASVs found??" fi .. code-block:: bash :caption: get the results_index from the metamate_out/results.csv file :linenos: #!/usr/bin/env Rscript ## read results.csv output_dir = Sys.getenv('output_dir') # = "metamate_out" as specified above find_results = read.csv(file.path(output_dir, "results.csv")) # get variables abundance_filt = Sys.getenv('abundance_filt') ## filter results if abundance_filt is FALSE if (abundance_filt == "FALSE"){ result_index = "0" # get first result_index (library_n = 0) write(result_index, file.path(output_dir, "selected_result_index.txt")) } ## filter results based on NA_abund_thresh if (abundance_filt != "FALSE"){ NA_abund_thresh = as.numeric(Sys.getenv('NA_abund_thresh')) filtered_data = find_results[ find_results$verifiednonauthentic_retained_p <= NA_abund_thresh, ] # if no results correspond with the NA_abund_thresh, then get the next best # else, just select the result_index that corresponds to # NA_abund_thresh with highest accuracy_score if (nrow(filtered_data) == 0) { cat( "\n no results correspond with the NA_abund_thresh of", NA_abund_thresh, "; getting the next best setting\n" ) next_best = min(find_results$verifiednonauthentic_retained_p) filtered_data = find_results[ find_results$verifiednonauthentic_retained_p <= next_best, ] # sort based on accuracy_score sorted_filtered = filtered_data[order(-filtered_data$accuracy_score), ] # get the result with the highest accuracy_score metamate_selected_threshold = sorted_filtered[1,] write.csv(metamate_selected_threshold, file.path(output_dir, "next_best_set.csv"), quote = F) # the result_index of the NA_abund_thresh with the highest accuracy_score result_index = metamate_selected_threshold[,1] write(result_index, file.path(output_dir, "selected_result_index.txt")) } else { # sort based on accuracy_score sorted_filtered = filtered_data[order(-filtered_data$accuracy_score), ] # get the result with the highest accuracy_score metamate_selected_threshold = sorted_filtered[1,] # the result_index of the NA_abund_thresh with the highest accuracy_score result_index = metamate_selected_threshold[,1] write(result_index, file.path(output_dir, "selected_result_index.txt")) } } .. code-block:: bash :caption: run metaMATE-dump to discard putative artefact ASVs :linenos: #!/bin/bash ## metaMATE-dump ASV_fasta=$(basename $ASV_fasta) # read result_index read -r result_index < $output_dir/selected_result_index.txt printf '%s\n' " - selected result_index = $result_index" # run metaMATE-dump printf '%s\n' "# Running metaMATE-dump" metamate dump \ --asvs $ASV_fasta \ --resultcache $output_dir/resultcache \ --output $output_dir/${ASV_fasta%.*}_metaMATE.filt \ --overwrite \ --resultindex $result_index # generate a list of ASV IDs seqkit seq -n $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta > \ $output_dir/${ASV_fasta%.*}_metaMATE.filt.list # filter the ASV table; include only the ASVs that are in ${ASV_fasta%.*}_metaMATE.filt.list awk -v var="$output_dir/${ASV_fasta%.*}" 'NR==1; NR>1 {print $0 | \ "grep -Fwf "var"_metaMATE.filt.list"}' $ASV_table > \ $output_dir/${ASV_table%.*}_metaMATE.filt.txt # filter the sintax.taxonomy.filt.txt file to include only ASVs retained by metaMATE awk -v var="$output_dir/${ASV_fasta%.*}" 'NR==1; NR>1 {print $0 | \ "grep -Fwf "var"_metaMATE.filt.list"}' sintax.taxonomy.filt.txt > \ $output_dir/sintax.taxonomy.metaMATE.filt.txt # write discarded ASVs list comm -3 <(sort <(seqkit seq -n $ASV_fasta)) \ <(sort $output_dir/${ASV_fasta%.*}_metaMATE.filt.list) \ > $output_dir/metaMATE.discarded.list .. code-block:: bash :caption: optionally rescue discarded ASVs that have GENUS level bootstrap value > 0.9 :linenos: #!/bin/bash # get discarded ASVs (sintax taxonomy list) grep -Fwf $output_dir/metaMATE.discarded.list sintax.taxonomy.filt.txt \ > $output_dir/metaMATE.discarded.sintax.taxonomy.txt # get the rescued ASVs that have GENUS level bootstrap value >= 0.9 awk -F'\t' '$26 >= 0.9' $output_dir/metaMATE.discarded.sintax.taxonomy.txt \ > $output_dir/rescued.txt # check if rescued.txt exists and is not empty if [[ -s $output_dir/rescued.txt ]]; then # add the rescued ASVs to $output_dir/sintax.taxonomy.metaMATE.filt.txt cat $output_dir/rescued.txt >> $output_dir/sintax.taxonomy.metaMATE.filt.txt # add the rescued ASVs to $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta seqkit grep -w 0 -f <(awk -F'\t' '{print $1}' $output_dir/rescued.txt) $ASV_fasta \ >> $output_dir/${ASV_fasta%.*}_metaMATE.filt.fasta # add the rescued ASVs to $output_dir/${ASV_table%.*}_metaMATE.filt.txt grep -wf <(awk -F'\t' '{print $1}' $output_dir/rescued.txt) $ASV_table \ >> $output_dir/${ASV_table%.*}_metaMATE.filt.txt printf '%s\n' "Rescued $(wc -l < $output_dir/rescued.txt) ASVs" else printf '%s\n' "No ASVs to rescue" fi .. note:: Herein case, the final filtered data is ``ASV_table_tax_filt_metaMATE.filt.txt`` and ``ASVs_tax_filt_metaMATE.filt.fasta`` in the ``metamate_out`` directory. The filtered SINTAX-classifier results (matching the ASVs in the latter files) is ``sintax.taxonomy.metaMATE.filt.txt`` in the ``metamate_out`` dir. If deemed relevant, then you may proceed with the below workflow below that includes clustering ASVs to OTUs. ____________________________________________________ .. _clusteringCOI: Clustering ASVs to OTUs ~~~~~~~~~~~~~~~~~~~~~~~ | This step clusters ASVs to OTUs with vsearch. .. code-block:: R :caption: get the size of ASVs :linenos: #!/usr/bin/env Rscript # specify input ASVs table and fasta ASV_table="ASV_table_tax_filt_metaMATE.filt.txt" # specify ASV table file ASV_fasta="ASVs_tax_filt_metaMATE.filt.fasta" # specify ASVs fasta file ################################ library(Biostrings) # Read the ASV table ASV_table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1) # add 'sum' column ASV_table$sum = rowSums(ASV_table) # make ASV_sums object ASV_sums = setNames(ASV_table$sum, rownames(ASV_table)) # Read the FASTA file ASV_fasta = readDNAStringSet(ASV_fasta) # add ";size=*" to ASV_fasta names(ASV_fasta) = sapply(names(ASV_fasta), function(header) { paste0(header, ";size=", ASV_sums[header]) }) # write fasta file writeXStringSet(ASV_fasta, "ASVs.size.fasta", width = max(width(ASV_fasta))) .. code-block:: bash :caption: clustering with vsearch :linenos: #!/bin/bash # specify the clustering threshold clustering_thresh="0.97" # make output dir. output_dir="OTU_table" mkdir -p $output_dir export output_dir ### cluster ASVs using vsearch. vsearch --cluster_fast ASVs.size.fasta \ --id $clustering_thresh \ --iddef 2 \ --sizein \ --xsize \ --fasta_width 0 \ --centroids $output_dir/OTUs.fasta \ --uc $output_dir/OTUs.uc .. code-block:: R :caption: generate an OTU table based on the clustered ASVs (.uc file). :linenos: #!/usr/bin/Rscript # specify input ASV table (the same one as for 'get the size of ASVs') ASV_table="ASV_table_tax_filt_metaMATE.filt.txt" # read output dir output_dir = Sys.getenv('output_dir') # read output from vsearch clustering (-uc OTU.uc) inp_UC = file.path(output_dir, "OTUs.uc") ################################ library(data.table) # load input data - ASV table ASV_table = fread(file = ASV_table, header = TRUE, sep = "\t") ## Load input data - UC mapping file UC = fread(file = inp_UC, header = FALSE, sep = "\t") UC = UC[ V1 != "S" ] UC[, ASV := tstrsplit(V9, ";", keep = 1) ] UC[, OTU := tstrsplit(V10, ";", keep = 1) ] UC[V1 == "C", OTU := ASV ] UC = UC[, .(ASV, OTU)] # convert ASV table to long format ASV = melt(data = ASV_table, id.vars = colnames(ASV_table)[1], variable.name = "SampleID", value.name = "Abundance") ASV = ASV[ Abundance > 0 ] # add colnames, to make sure 1st is 'ASV' colnames(ASV) = c("ASV", "SampleID", "Abundance") # add OTU IDs ASV = merge(x = ASV, y = UC, by = "ASV", all.x = TRUE) # summarize OTU = ASV[ , .(Abundance = sum(Abundance, na.rm = TRUE)), by = c("SampleID", "OTU")] # reshape OTU table to wide format OTU_table = dcast(data = ASV, formula = OTU ~ SampleID, value.var = "Abundance", fun.aggregate = sum, fill = 0) # write OTU table # OTU names correspond to most abundant ASV in an OTU fwrite(x = OTU_table, file = file.path(output_dir, "OTU_table.txt"), sep = "\t") .. _postclusteringCOI: Post-clustering ~~~~~~~~~~~~~~~ Post-cluster OTUs with LULU to merge consistently co-occurring 'daughter-OTUs'. .. code-block:: bash :caption: generate match list for post-clustering :linenos: #!/bin/bash # go to directrory that contains OTUs cd $output_dir # 'OTU_table' in this case # make blast database for post-clustering makeblastdb -in OTUs.fasta -parse_seqids -dbtype nucl # generate match list for post-clustering blastn -db OTUs.fasta \ -outfmt '6 qseqid sseqid pident' \ -out match_list.txt \ -qcov_hsp_perc 75 \ -perc_identity 90 \ -query OTUs.fasta \ -num_threads 20 .. code-block:: R :caption: run LULU post-clustering :linenos: #!/usr/bin/Rscript # specify minimum threshold of sequence similarity considering any OTU as an error of another min_match = "90" # specify OTU table OTU_table="OTU_table.txt" ################################ library(devtools) # load OTU table and match list otutable = read.table(OTU_table, header = T, row.names = 1, sep = "\t") matchlist = read.table("match_list.txt") curated_result = lulu::lulu(otutable, matchlist, minimum_match = min_match) # write post-clustered OTU table to file curated_table = curated_result$curated_table curated_table = cbind(OTU = rownames(curated_table), curated_table) write.table(curated_table, file ="OTU_table_LULU.txt", sep = "\t", row.names = F, quote = FALSE) write.table(curated_result$discarded_otus, file ="merged_units.lulu", col.names = FALSE, quote = FALSE) .. note:: Note that if the sample names start with a number, then the output OTU table may contain "X" prefix in the sample names. .. code-block:: bash :caption: match OTUs.fasta and taxonomy table with post-clustered table (OTU_table_LULU) :linenos: #!/bin/bash # specify post-clustered table OTU_table="OTU_table_LULU.txt" # specify pre post-clustered OTUs fasta file OTUs_fasta="OTUs.fasta" # get matching OTUs awk 'NR>1{print $1}' $OTU_table > OTUs_LULU.list cat $OTUs_fasta | \ seqkit grep -w 0 -f OTUs_LULU.list > OTUs_LULU.fasta # get matching sintax taxonomy results head -n 1 ../sintax.taxonomy.metaMATE.filt.txt > sintax.taxonomy.txt cat ../sintax.taxonomy.metaMATE.filt.txt | \ grep -wf OTUs_LULU.list >> sintax.taxonomy.txt # remove unnecessary files rm OTUs.fasta.n* # move OTU_table two directories down cd .. mv $output_dir ../.. .. note:: The final OTUs data is ``OTU_table_LULU.txt`` and ``OTUs_LULU.fasta`` in the ``OTU_table`` directory. The matching SINTAX taxonomy files are ``sintax.taxonomy.txt`` in the ``OTU_table`` directory. ____________________________________________________ We acknowledge `CSC - IT Center for Science `_, Finland, for computational resources while building and testing this workflows. ____________________________________________________ |logo_BGE_small| |eufund| |chfund| |ukrifund|