.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png :width: 300 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. |eufund| image:: _static/eu_co-funded.png :width: 200 :alt: Alternative text .. |chfund| image:: _static/ch-logo-200x50.png :width: 210 :alt: Alternative text .. |ukrifund| image:: _static/ukri-logo-200x59.png :width: 150 :alt: Alternative text .. |logo_BGE_small| image:: _static/logo_BGE_alpha.png :width: 120 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. |main_interface| image:: _static/main_interface.png :width: 2000 :alt: Alternative text .. raw:: html .. role:: red .. raw:: html .. role:: yellow-background |logo_BGE_alpha| Fungi/ITS ********* This is executable step-by-step pipeline for **ITS2** amplicon data from Illumina sequencing machine. The bioinformatic workflow results in amplicon sequence variants (ASVs) and well as operational taxonomic units (OTUs). .. note:: The **full bioinformatics workflow can be automatically run through** `PipeCraft2 `_, a software package which represents a graphical user interface wrapper for all the bioinformatic steps below. `See the example workflow for ITS2 `_. |main_interface| .. admonition:: Citation of the pipeline When using this pipeline, please cite as: Anslan, S, Bahram, M, Hiiesalu, I, Tedersoo, L. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data. Mol Ecol Resour. 2017; 17: e234-e240. https://doi.org/10.1111/1755-0998.12692 Also, please **cite the original resources of the wrapped software**. Dependencies ~~~~~~~~~~~~ +-----------------------------------------------------+----------+---------+ | Process | Software | Version | +=====================================================+==========+=========+ | :ref:`Remove primers ` | cutadapt | 4.4 | +-----------------------------------------------------+----------+---------+ | :ref:`Quality filtering ` | DADA2 | 2.28 | +-----------------------------------------------------+----------+---------+ | :ref:`Denoise ` | DADA2 | 2.28 | +-----------------------------------------------------+----------+---------+ | :ref:`Merge paired-end reads ` | DADA2 | 2.28 | +-----------------------------------------------------+----------+---------+ | :ref:`Chimera filtering ` | DADA2 | 2.28 | +-----------------------------------------------------+----------+---------+ | :ref:`Remove tag-jumps ` | UNCROSS2 | | +-----------------------------------------------------+----------+---------+ | :ref:`Extract ITS1/2 region ` | ITSx | 1.1.3 | +-----------------------------------------------------+----------+---------+ | :ref:`Merge sequencing runs* ` | | | +-----------------------------------------------------+----------+---------+ | :ref:`Taxonomy assignment ` | | | +-----------------------------------------------------+----------+---------+ | :ref:`Clustering ASVs to OTUs ` | vsearch | 2.28.1 | +-----------------------------------------------------+----------+---------+ | :ref:`Post-clusteringlustering ` | BLAST | 2.12.0+ | +-----------------------------------------------------+----------+---------+ | :ref:`Post-clusteringlustering ` | LULU | 0.1.0+ | +-----------------------------------------------------+----------+---------+ \*only applicable when there are multiple sequencing runs per study. .. note:: All the dependencies/software of the pipeline are available on a `Docker image `_. | Download `Docker for windows `_ | Download `Docker for Mac `_ | Install Docker for Linux - `follow the guidelines under appropriate Linux distribution `_ .. code-block:: bash :caption: get the Docker image docker pull pipecraft/bioscanflow:2 .. code-block:: bash :caption: example of running the pipeline via Docker image # run docker # specify the files location with -v flag ($PWD = the current working directory) docker run -i --tty -v $PWD/:/Files pipecraft/bioscanflow:2 # inside the container, the files are accessible in the /Files dir cd Files # checking if cutadapt is available cutadapt -h # ready to run the pipe as below ... ## make sure that via the shared folder (-v) path you have access also to the reference databases. Data structure ~~~~~~~~~~~~~~ .. _multiRunDirITS: Multiple sequencing runs ------------------------ .. important:: When aiming to combine samples from multiple sequencing runs, then follow the below directory structure **Directory structure:** | **/multiRunDir** *(directory names can be changed)* | ├── **/sequencing_set01** | │ ├── *sample1.R1.fastq* | │ ├── *sample1.R2.fastq* | │ ├── *sample2.R1.fastq* | │ ├── *sample2.R2.fastq* | │ ├── ... | ├── **/sequencing_set02** | │ ├── *sampleA.R1.fastq* | │ ├── *sampleA.R2.fastq* | │ ├── *sampleB.R1.fastq* | │ ├── *sampleB.R2.fastq* | │ ├── ... | └── **/sequencing_set03** | ├── *sample11.R1.fastq* | ├── *sample11.R2.fastq* | ├── *sample12.R1.fastq* | ├── *sample12.R2.fastq* | ├── ... .. note:: Fastq files with the **same name** will be considered as the same sample and will be merged in the "Merge sequencing runs" step. Single sequencing run --------------------- | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` | ____________________________________________________ .. _remove_primersITS: Remove primers ~~~~~~~~~~~~~~ | Remove primer strings from paired-end data. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. note:: Here, assuming that all sequences are in 5'-3' orientation! *(3'-5' orient sequences will be discarded with this workflow)* .. important:: | - Paired-end files must contain "R1" and "R2" strings (not just _1 and _2)! | - Sample names must not contain "R1" and "R2" strings (i.e. not FR123_001_R1.fastq/FR123_001_R2.fastq) .. code-block:: bash :caption: remove primers with cutadapt :emphasize-lines: 31-36, 61-62 :linenos: #!/bin/bash ## workflow to remove primers via cutadapt # My working folder = /multiRunDir (see dir structure above) # specify the identifier string for the R1 files read_R1="_R1" ## specify primers # ITS2 primers fwd_primer=$"GTGARTCATCRARTYTTTG" #this is primer gITS7ngs rev_primer=$"CCTSCSCTTANTDATATGC" #this is primer ITS4ngsUni # ITS1 primers #fwd_primer=$"CTTGGTCATTTAGAGGAAGTAA" #this is primer ITS1F #rev_primer=$"GCTGCGTTCTTCATCGATGC" #this is primer ITS2 # edit primer trimming settings maximum_error_rate="2" # Maximum error rate in primer string search; # if set as 1, then allow 1 mismatch; # if set as 0.1, then allow mismatch in 10% of the bases, # i.e. if a primer is 20 bp then allowing 2 mismatches. overlap="17" # The minimum overlap length. Keep it nearly as high # as the primer length to avoid short random matches. pair_filter="both" # Option 'any' discards a read pair if primers are not found in # either of the read pairs (R1 and R2). # Option 'both' keeps the read pair if a primer is found in # at least one of the read pairs. # get directory names if working with multiple sequencing runs DIRS=$(ls -d *) # -> sequencing_set01 sequencing_set02 sequencing_set03 for sequencing_run in $DIRS; do printf "\nWorking with $sequencing_run \n" cd $sequencing_run #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # make output dirs mkdir -p primersCut_out mkdir -p primersCut_out/untrimmed ### Clip primers with cutadapt for inputR1 in *$read_R1*; do inputR2=$(echo $inputR1 | sed -e 's/R1/R2/') cutadapt --quiet \ -e $maximum_error_rate \ --minimum-length 32 \ --overlap $overlap \ --no-indels \ --cores=0 \ --untrimmed-output primersCut_out/untrimmed/$inputR1 \ --untrimmed-paired-output primersCut_out/untrimmed/$inputR2 \ --pair-filter=$pair_filter \ -g $fwd_primer \ -G $rev_primer \ -o primersCut_out/$inputR1 \ -p primersCut_out/$inputR2 \ $inputR1 $inputR2 done #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# cd .. done ____________________________________________________ .. _quality_filteringITS: Quality filtering ~~~~~~~~~~~~~~~~~ | Quality filtering of the fastq files based on the allowed maximum error rate per sequence (as in DADA2). | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: quality filtering in DADA2 (in R) :emphasize-lines: 13-19, 67-71 :linenos: #!/usr/bin/Rscript ## workflow to perform quality filtering within DADA2 #load dada2 library library('dada2') # specify the identifier string for the R1 files read_R1 = "_R1" # get the identifier string for the R2 files read_R2 = gsub("R1", "R2", read_R1) # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # output path path_results = "qualFiltered_out" # input and output file paths R1s = sort(list.files("primersCut_out", pattern = read_R1, full.names = TRUE)) R2s = sort(list.files("primersCut_out", pattern = read_R2, full.names = TRUE)) #sample names sample_names = sapply(strsplit(basename(R1s), read_R1), `[`, 1) # filtered files path filtR1 = file.path(path_results, paste0(sample_names, ".R1.", "fastq.gz")) filtR2 = file.path(path_results, paste0(sample_names, ".R2.", "fastq.gz")) names(filtR1) = sample_names names(filtR2) = sample_names #quality filtering qfilt = filterAndTrim(R1s, filtR1, R2s, filtR2, maxN = 0, # max number of allowed N bases. maxEE = c(2, 2), # max error rate per R1 and R2 read, respectively. truncQ = 2, # truncate reads at the first instance of a quality score less than or equal to specified value. truncLen = c(0, 0), # truncate reads after specified length for R1 and R2 reads, respectively. maxLen = 600, # discard reads longer than specified. minLen = 100, # discard reads shorter than specified. minQ = 2, # discard reads (after truncation) that contain a quality score below specified value. matchIDs = TRUE, # output paired-end reads with matching IDs (for merging). compress = TRUE, # gzip the output multithread = TRUE) # use multiple threads saveRDS(qfilt, file.path(path_results, "qfilt_reads.rds")) # make sequence count report seq_count = cbind(qfilt) colnames(seq_count) = c("input", "qualFiltered") seq_count = as.data.frame(seq_count) seq_count$sample = sample_names # reorder columns seq_count = seq_count[, c("sample", "input", "qualFiltered")] write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = FALSE, quote = FALSE) # save filtered R objects for denoising and merging (below) filtR1 = sort(list.files(path_results, pattern = ".R1.fastq.gz", full.names = TRUE)) filtR2 = sort(list.files(path_results, pattern = ".R2.fastq.gz", full.names = TRUE)) sample_names = sapply(strsplit(basename(filtR1), ".R1.fastq.gz"), `[`, 1) saveRDS(filtR1, file.path(path_results, "filtR1.rds")) saveRDS(filtR2, file.path(path_results, "filtR2.rds")) saveRDS(sample_names, file.path(path_results, "sample_names.rds")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #set working directory back to "/multiRunDir" setwd(wd) i = i + 1 } } ____________________________________________________ .. _denoiseITS: Denoise and merge paired-end reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Denoise and merge paired-end Illumina reads as in DADA2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: denoise and merge paired-end reads in DADA2 :emphasize-lines: 7-13, 73-77 :linenos: #!/usr/bin/Rscript ## workflow to perform DADA2 denoising and merging # load dada2 library library('dada2') # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #load quality filtered files filtR1 = readRDS("qualFiltered_out/filtR1.rds") filtR2 = readRDS("qualFiltered_out/filtR2.rds") qfilt = readRDS("qualFiltered_out/qfilt_reads.rds") sample_names = readRDS("qualFiltered_out/sample_names.rds") cat("\n sample names = ", sample_names, "\n ") names(filtR1) = sample_names names(filtR2) = sample_names # create output dir path_results = "denoised_merged" dir.create(path_results, showWarnings = FALSE) print("# Denoising ...") # learn the error rates errF = learnErrors(filtR1, multithread = TRUE) errR = learnErrors(filtR2, multithread = TRUE) # make error rate figures pdf(file.path(path_results, "Error_rates_R1.pdf")) print( plotErrors(errF) ) dev.off() pdf(file.path(path_results, "Error_rates_R2.pdf")) print( plotErrors(errR) ) dev.off() # Sample inference and merger of paired-end reads mergers = vector("list", length(sample_names)) names(mergers) = sample_names for(sample in sample_names) { cat("\n -- Processing:", sample, "\n") derepF = derepFastq(filtR1[[sample]]) ddF = dada(derepF, err = errF, multithread = TRUE) derepR = derepFastq(filtR2[[sample]]) ddR = dada(derepR, err = errR, multithread = TRUE) merger = mergePairs(ddF, derepF, ddR, derepR) mergers[[sample]] = merger } rm(derepF); rm(derepR) gc() saveRDS(mergers, (file.path(path_results, "mergers.rds"))) # make sequence table ASV_tab = makeSequenceTable(mergers) #write RDS object saveRDS(ASV_tab, (file.path(path_results, "rawASV_table.rds"))) # make sequence count report getN = function(x) sum(getUniques(x)) #remove 0 seqs samples from qfilt statistics row_sub = apply(qfilt, 1, function(row) all(row !=0 )) qfilt = qfilt[row_sub, ] seq_count = cbind(qfilt, sapply(mergers, getN)) colnames(seq_count) = c("input", "qualFiltered", "denoised_and_merged") rownames(seq_count) = sample_names write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = TRUE, quote = FALSE) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _remove_chimerasITS: Chimera filtering ~~~~~~~~~~~~~~~~~ | Remove putative chimeras with DADA2 'consensus' mode. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: remove chimeras in DADA2 :emphasize-lines: 14-20, 97-100 :linenos: #!/usr/bin/Rscript ## workflow to perform chimera filtering within DADA2 # load libraries library('dada2') library('openssl') # chimera filtering method method = "consensus" # collapse ASVs that have no mismatshes or internal indels (identical up to shifts and/or length) collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load denoised and merged ASVs rawASV_table = readRDS("denoised_merged/rawASV_table.rds") # create output dir path_results="ASV_table" dir.create(path_results, showWarnings = FALSE) # Remove chimeras print("Removing chimeric ASVs ...") chim_filt = removeBimeraDenovo( rawASV_table, method = method, multithread = TRUE, verbose = TRUE) saveRDS(chim_filt, "ASV_table/chim_filt.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(chim_filt) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tchim_filt = t(chim_filt) # add sequences to 1st column tchim_filt = cbind(row.names(tchim_filt), tchim_filt) colnames(tchim_filt)[1] = "Sequence" # row names as sequence headers row.names(tchim_filt) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs.fasta")) # write ASVs table to path_results write.table(tchim_filt, file.path(path_results, "ASV_table.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) ### collapse ASVs that have no mismatshes or internal indels # (identical up to shifts and/or length) if (collapseNoMismatch == "true") { print("Collapsing identical ASVs ...") ASV_tab_collapsed = collapseNoMismatch(chim_filt, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(ASV_tab_collapsed, file.path(path_results, "ASV_table_collapsed.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(ASV_tab_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tASV_tab_collapsed = t(ASV_tab_collapsed) # add sequences to 1st column tASV_tab_collapsed = cbind(row.names(tASV_tab_collapsed), tASV_tab_collapsed) colnames(tASV_tab_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tASV_tab_collapsed) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_collapsed.fasta")) # write ASVs table to path_results write.table(tASV_tab_collapsed, file.path(path_results, "ASVs_table_collapsed.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(ASV_tab_collapsed)), " chimera filtered (+collapsed) ASVs, with a total of ", sum(rowSums(ASV_tab_collapsed)), " sequences.")) print("--------") } else { # print summary print(paste0("Output = ", length(colnames(chim_filt)), " chimera filtered ASVs, with a total of ", sum(rowSums(chim_filt)), " sequences.")) print("--------") } #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# setwd(wd) i = i + 1 } } .. _tagjumpsITS: ____________________________________________________ Remove tag-jumps ~~~~~~~~~~~~~~~~ | Remove putative tag-jumps with UNCROSS2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: removing putative tag-jumps with UNCROSS2 method :emphasize-lines: 15-21, 115-119 :linenos: #!/usr/bin/Rscript ## Script to perform tag-jump removal; (C) Vladimir Mikryukov, # edit, Sten Anslan # load libraries library(data.table) # set parameters set_f = 0.03 # f-parameter of UNCROSS (e.g., 0.03) set_p = 1 # p-parameter (e.g., 1.0) # output dir path_results="ASV_table" # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load ASV table # loading ASV_table_collapsed if collapseNoMismatch was "true" (above) if (file.exists("ASV_table/ASV_table_collapsed.rds") == TRUE) { tab = readRDS("ASV_table/ASV_table_collapsed.rds") cat("input table = ASV_table/ASV_table_collapsed.rds\n") } else { # loading chimera filtered ASV table tab = readRDS("ASV_table/chim_filt.rds") cat("input table = ASV_table/chim_filt.rds\n") } # format ASV table ASVTABW = as.data.table(t(tab), keep.rownames = TRUE) colnames(ASVTABW)[1] = "ASV" # convert to long format ASVTAB = melt(data = ASVTABW, id.vars = "ASV", variable.name = "SampleID", value.name = "Abundance") # remove zero-OTUs ASVTAB = ASVTAB[ Abundance > 0 ] # estimate total abundance of sequence per plate ASVTAB[ , Total := sum(Abundance, na.rm = TRUE), by = "ASV" ] ## UNCROSS score uncross_score = function(x, N, n, f = 0.01, tmin = 0.1, p = 1){ z = f * N / n # Expected treshold sc = 2 / (1 + exp(x/z)^p) # t-score res = data.table(Score = sc, TagJump = sc >= tmin) return(res) } # esimate UNCROSS score cat(" estimating UNCROSS score\n") ASVTAB = cbind( ASVTAB, uncross_score( x = ASVTAB$Abundance, N = ASVTAB$Total, n = length(unique(ASVTAB$SampleID)), f = as.numeric(set_f), p = as.numeric(set_p) ) ) cat(" number of tag-jumps: ", sum(ASVTAB$TagJump, na.rm = TRUE), "\n") # tag-jump stats TJ = data.table( Total_reads = sum(ASVTAB$Abundance), Number_of_TagJump_Events = sum(ASVTAB$TagJump), TagJump_reads = sum(ASVTAB[ TagJump == TRUE ]$Abundance, na.rm = T)) TJ$ReadPercent_removed = with(TJ, (TagJump_reads / Total_reads * 100)) fwrite(x = TJ, file = "ASV_table/TagJump_stats.txt", sep = "\t") # prepare ASV tables, remove tag-jumps ASVTAB = ASVTAB[ TagJump == FALSE ] # convert to wide format RES = dcast(data = ASVTAB, formula = ASV ~ SampleID, value.var = "Abundance", fill = 0) # sort rows (by total abundance) clz = colnames(RES)[-1] otu_sums = rowSums(RES[, ..clz], na.rm = TRUE) RES = RES[ order(otu_sums, decreasing = TRUE) ] # output table that is compadible with dada2 output = as.matrix(RES, sep = "\t", header = TRUE, rownames = 1, check.names = FALSE, quote = FALSE) output = t(output) saveRDS(output, ("ASV_table/ASV_table_TagJumpFiltered.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(output) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table toutput = t(output) # add sequences to 1st column toutput = cbind(row.names(toutput), toutput) colnames(toutput)[1] = "Sequence" #row names as sequence headers row.names(toutput) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_TagJumpFiltered.fasta")) # write ASVs table to path_results write.table(toutput, file.path(path_results, "ASV_table_TagJumpFiltered.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(output)), " ASVs, with a total of ", sum(rowSums(output)), " sequences.")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _mergeRunsITS: Merge sequencing runs ~~~~~~~~~~~~~~~~~~~~~ | If previous processing was applied on :ref:`multiple sequencing runs ` , then here, | merge those sequenceing runs to form a single, unified ASV table. | Assuming that tag-jump filtering was performed (inputs = ASV_table_TagJumpFiltered.rds) .. code-block:: R :caption: merge ASV tables from multiple sequencing runs :emphasize-lines: 1-88 :linenos: #!/usr/bin/Rscript ## Merge sequencing runs, if working with multiple ones # load libraries library('dada2') # after merging multiple ASV tables ... # collapse ASVs that have no mismatshes or internal indels collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) tables = c() # load tables from multiple sequencing runs (dirs) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) tables = append(tables, print(file.path(paste0(dirs[i], "/ASV_table"), "ASV_table_TagJumpFiltered.rds"))) setwd(wd) i = i + 1 } } # Merge multiple ASV tables print("# Merging multiple ASV tables ...") ASV_tables = lapply(tables, readRDS) merged_table = mergeSequenceTables(tables = ASV_tables, repeats = "sum", tryRC = FALSE) ### collapse ASVs that have no mismatshes or internal indels if (collapseNoMismatch == "true") { print("# Collapsing identical ASVs ...") merged_table_collapsed = collapseNoMismatch(merged_table, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(merged_table_collapsed, "merged_table_collapsed.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table_collapsed = t(merged_table_collapsed) # add sequences to 1st column tmerged_table_collapsed = cbind(row.names(tmerged_table_collapsed), tmerged_table_collapsed) colnames(tmerged_table_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table_collapsed) = asv_headers # write ASVs.fasta asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged_collapsed.fasta") # write ASVs table write.table(tmerged_table_collapsed, "ASV_table.merged_collapsed.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table_collapsed)), " ASVs, with a total of ", sum(rowSums(merged_table_collapsed)), " sequences.")) } else { saveRDS(merged_table, "merged_table.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table = t(merged_table) # add sequences to 1st column tmerged_table = cbind(row.names(tmerged_table), tmerged_table) colnames(tmerged_table)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged.fasta") # write ASVs table to path_results write.table(tmerged_table, "ASV_table.merged.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table)), " ASVs, with a total of ", sum(rowSums(merged_table)), " sequences.")) } ____________________________________________________ .. _itsx: Extract ITS1/ITS2 region ~~~~~~~~~~~~~~~~~~~~~~~~ Extract the ITS1/ITS2 region, i.e., clip conservative primer binding sites (18S, 5.8S, 28S) from the ASVs, for making ASVs that differ only withing the ITS1/ITS2 part. .. code-block:: bash :caption: Extract ITS1/ITS2 with ITSx :linenos: #!/bin/bash # specify the input ASVs fasta file fasta="ASVs_TagJumpFiltered.fasta" # ASVs_TagJumpFiltered.fasta/ASVs.merged.fasta # specify the target ITS region region="ITS2" # ITS1/ITS2 # output dir output_dir=$"ITSx_out" mkdir $output_dir # ITSx ITSx -i $fasta \ --save_regions $region \ -o $output_dir/${fasta%.*} \ --preserve T \ --graphical F \ -t all \ --cpu 20 \ -E 0.01 \ --complement F \ --truncate T .. code-block:: bash :caption: collapse identical ASVs after ITSx :linenos: #!/bin/bash vsearch --derep_fulllength $output_dir/${fasta%.*}.$region.${fasta##*.} \ --output $output_dir/${fasta%.*}.$region.derep.${fasta##*.} \ --uc $output_dir/derep.uc \ --fasta_width 0 # export vars for R derep_fasta="$output_dir/${fasta%.*}.$region.derep.${fasta##*.}" export derep_fasta export output_dir # output_dir=$"ITSx_out" .. code-block:: R :caption: make ITS1/2 amplicon ASV table after ITSx :linenos: #!/usr/bin/Rscript ### Generate ASV table based on the dereplicated ASVs that went through ITSx ### Herein "OTUs" refer to "ASVs after ITSx". # specify input ASV table (the same one as for 'get the size of ASVs') ASV_table="ASV_table_TagJumpFiltered.fasta" # read output dir output_dir = Sys.getenv('output_dir') # read output from vsearch clustering (-uc OTU.uc) inp_UC = file.path(output_dir, "derep.uc") ################################ library(data.table) # load input data - ASV table ASV_table = fread(file = ASV_table, header = TRUE, sep = "\t") ## Load input data - UC mapping file UC = fread(file = inp_UC, header = FALSE, sep = "\t") UC = UC[ V1 != "S" ] UC[, ASV := tstrsplit(V9, ";", keep = 1) ] UC[, OTU := tstrsplit(V10, ";", keep = 1) ] UC[V1 == "C", OTU := ASV ] UC = UC[, .(ASV, OTU)] # convert ASV table to long format ASV = melt(data = ASV_table, id.vars = colnames(ASV_table)[1], variable.name = "SampleID", value.name = "Abundance") ASV = ASV[ Abundance > 0 ] # add colnames, to make sure 1st is 'ASV' colnames(ASV) = c("ASV", "SampleID", "Abundance") # add OTU IDs ASV = merge(x = ASV, y = UC, by = "ASV", all.x = TRUE) # summarize OTU = ASV[ , .(Abundance = sum(Abundance, na.rm = TRUE)), by = c("SampleID", "OTU")] # reshape OTU table to wide format OTU_table = dcast(data = ASV, formula = OTU ~ SampleID, value.var = "Abundance", fun.aggregate = sum, fill = 0) # write OTU table # OTU names correspond to most abundant ASV in an OTU fwrite(x = OTU_table, file = file.path(output_dir, "ASV_table_ITSx.txt"), sep = "\t") .. code-block:: bash :caption: make fasta file corresponding to new ASV table :linenos: #!/bin/bash # make fasta file corresponding to ASV table awk 'NR>1{print $1}' ASV_table_ITSx.txt > ASVs.list cat $fasta | seqkit grep -w 0 -f ASVs.list > ASVs.ITSx.fasta ____________________________________________________ .. _taxAssignITS: Taxonomy assignment ~~~~~~~~~~~~~~~~~~~ Taxonomy assignment with the `SINTAX `_ against `EUKARYOME database `_ . **---** `Download the EUKARYOME (v1.9.2) for SINTAX here (click) `_ **---** .. code-block:: bash :caption: assign taxonomy with SINTAX :linenos: #!/bin/bash # download the EUKARYOME reference databse wget \ "https://sisu.ut.ee/wp-content/uploads/sites/643/SINTAX_EUK_ITS_v1.9.zip" # unzip the database and edit name unzip SINTAX_EUK_ITS_v1.9.zip -d EUKARYOME1.9.2_sintax # specify reference database reference_database="EUKARYOME1.9.2_sintax/SINTAX_EUK_ITS_v1.9.fasta" reference_database=$(realpath $reference_database) # get database names with full path # specify input fasta file cd ASV_table ASV_fasta="ASVs.ITSx.fasta" # Run SINTAX classifier vsearch -sintax $ASV_fasta \ -db $reference_database \ -tabbedout taxonomy.txt \ -strand both \ -sintax_cutoff 0.8 \ --threads 8 ____________________________________________________ .. _clusteringITS: Clustering ASVs to OTUs ~~~~~~~~~~~~~~~~~~~~~~~ | This step clusters ASVs to OTUs with vsearch. .. code-block:: R :caption: get the size of ASVs :linenos: #!/usr/bin/env Rscript # specify input ASVs table and fasta ASV_table="ASV_table_ITSx.txt" # specify ASV table file ASV_fasta="ASVs.ITSx.fasta" # specify ASVs fasta file ################################ library(Biostrings) # Read the ASV table ASV_table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1) # add 'sum' column ASV_table$sum = rowSums(ASV_table) # make ASV_sums object ASV_sums = setNames(ASV_table$sum, rownames(ASV_table)) # Read the FASTA file ASV_fasta = readDNAStringSet(ASV_fasta) # add ";size=*" to ASV_fasta names(ASV_fasta) = sapply(names(ASV_fasta), function(header) { paste0(header, ";size=", ASV_sums[header]) }) # write fasta file writeXStringSet(ASV_fasta, "ASVs.size.fasta", width = max(width(ASV_fasta))) .. code-block:: bash :caption: clustering with vsearch :linenos: #!/bin/bash # specify the clustering threshold clustering_thresh="0.97" # make output dir. output_dir="OTU_table" mkdir -p $output_dir export output_dir ### cluster ASVs using vsearch. vsearch --cluster_fast ASVs.size.fasta \ --id $clustering_thresh \ --iddef 2 \ --sizein \ --xsize \ --fasta_width 0 \ --centroids $output_dir/OTUs.fasta \ --uc $output_dir/OTUs.uc .. code-block:: R :caption: generate an OTU table based on the clustered ASVs (.uc file). :linenos: #!/usr/bin/Rscript # specify input ASV table (the same one as for 'get the size of ASVs') ASV_table="ASV_table_ITSx.txt" # read output dir output_dir = Sys.getenv('output_dir') # read output from vsearch clustering (-uc OTU.uc) inp_UC = file.path(output_dir, "OTUs.uc") ################################ library(data.table) # load input data - ASV table ASV_table = fread(file = ASV_table, header = TRUE, sep = "\t") ## Load input data - UC mapping file UC = fread(file = inp_UC, header = FALSE, sep = "\t") UC = UC[ V1 != "S" ] UC[, ASV := tstrsplit(V9, ";", keep = 1) ] UC[, OTU := tstrsplit(V10, ";", keep = 1) ] UC[V1 == "C", OTU := ASV ] UC = UC[, .(ASV, OTU)] # convert ASV table to long format ASV = melt(data = ASV_table, id.vars = colnames(ASV_table)[1], variable.name = "SampleID", value.name = "Abundance") ASV = ASV[ Abundance > 0 ] # add colnames, to make sure 1st is 'ASV' colnames(ASV) = c("ASV", "SampleID", "Abundance") # add OTU IDs ASV = merge(x = ASV, y = UC, by = "ASV", all.x = TRUE) # summarize OTU = ASV[ , .(Abundance = sum(Abundance, na.rm = TRUE)), by = c("SampleID", "OTU")] # reshape OTU table to wide format OTU_table = dcast(data = ASV, formula = OTU ~ SampleID, value.var = "Abundance", fun.aggregate = sum, fill = 0) # write OTU table # OTU names correspond to most abundant ASV in an OTU fwrite(x = OTU_table, file = file.path(output_dir, "OTU_table.txt"), sep = "\t") ____________________________________________________ .. _postclusteringITS: Post-clustering ~~~~~~~~~~~~~~~ Post-cluster OTUs with LULU to merge consistently co-occurring 'daughter-OTUs'. .. code-block:: bash :caption: generate match list for post-clustering :linenos: #!/bin/bash # go to directrory that contains OTUs cd $output_dir # 'OTU_table' in this case # make blast database for post-clustering makeblastdb -in OTUs.fasta -parse_seqids -dbtype nucl # generate match list for post-clustering blastn -db OTUs.fasta \ -outfmt '6 qseqid sseqid pident' \ -out match_list.txt \ -qcov_hsp_perc 75 \ -perc_identity 90 \ -query OTUs.fasta \ -num_threads 20 .. code-block:: R :caption: run LULU post-clustering :linenos: #!/usr/bin/Rscript # specify minimum threshold of sequence similarity considering any OTU as an error of another min_match = "90" # specify OTU table OTU_table="OTU_table.txt" ################################ library(devtools) # load OTU table and match list otutable = read.table(OTU_table, header = T, row.names = 1, sep = "\t") matchlist = read.table("match_list.txt") curated_result = lulu::lulu(otutable, matchlist, minimum_match = min_match) # write post-clustered OTU table to file curated_table = curated_result$curated_table curated_table = cbind(OTU = rownames(curated_table), curated_table) write.table(curated_table, file ="OTU_table_LULU.txt", sep = "\t", row.names = F, quote = FALSE) write.table(curated_result$discarded_otus, file ="merged_units.lulu", col.names = FALSE, quote = FALSE) .. note:: Note that if the sample names start with a number, then the output OTU table may contain "X" prefix in the sample names. .. code-block:: bash :caption: match OTUs.fasta and taxonomy table with post-clustered table (OTU_table_LULU) :linenos: #!/bin/bash # specify post-clustered table OTU_table="OTU_table_LULU.txt" # specify pre post-clustered OTUs fasta file OTUs_fasta="OTUs.fasta" # get matching OTUs awk 'NR>1{print $1}' $OTU_table > OTUs_LULU.list cat $OTUs_fasta | \ seqkit grep -w 0 -f OTUs_LULU.list > OTUs_LULU.fasta # get matching SINTAX taxonomy results head -n 1 ../taxonomy.txt > taxonomy.txt cat ../taxonomy.txt | \ grep -wf OTUs_LULU.list >> taxonomy.txt # remove unnecessary files rm OTUs.fasta.n* # move OTU_table two directories down cd .. mv $output_dir ../.. .. note:: The final OTUs data is ``OTU_table_LULU.txt`` and ``OTUs_LULU.fasta`` in the ``OTU_table`` directory. The matching SINTAX taxonomy files are ``taxonomy.txt`` in the ``OTU_table`` directory. ____________________________________________________ We acknowledge `CSC - IT Center for Science `_, Finland, for computational resources while building and testing this workflows. ____________________________________________________ |logo_BGE_small| |eufund| |chfund| |ukrifund|