.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png :width: 300 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. |eufund| image:: _static/eu_co-funded.png :width: 200 :alt: Alternative text .. |chfund| image:: _static/ch-logo-200x50.png :width: 210 :alt: Alternative text .. |ukrifund| image:: _static/ukri-logo-200x59.png :width: 150 :alt: Alternative text .. |logo_BGE_small| image:: _static/logo_BGE_alpha.png :width: 120 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. |main_interface| image:: _static/main_interface.png :width: 2000 :alt: Alternative text .. raw:: html .. role:: red .. raw:: html .. role:: yellow-background |logo_BGE_alpha| 16S/18S/12S *********** This is executable step-by-step pipeline for **16S/18S/12S** amplicon data from Illumina sequencing machine. The bioinformatic workflow results in amplicon sequence variants (ASVs) and well as operational taxonomic units (OTUs). .. note:: The **full bioinformatics workflow can be automatically run through** `PipeCraft2 `_, a software package which represents a graphical user interface wrapper for all the bioinformatic steps below. `See the example workflow for 16S `_. |main_interface| .. admonition:: Citation of the pipeline When using this pipeline, please cite as: Anslan, S, Bahram, M, Hiiesalu, I, Tedersoo, L. PipeCraft: Flexible open-source toolkit for bioinformatics analysis of custom high-throughput amplicon sequencing data. Mol Ecol Resour. 2017; 17: e234-e240. https://doi.org/10.1111/1755-0998.12692 Also, please **cite the original resources of the wrapped software**. Dependencies ~~~~~~~~~~~~ +-----------------------------------------------------+----------+----------------+ | Process | Software | Version | +=====================================================+==========+================+ | :ref:`Remove primers ` | cutadapt | 4.9 | +-----------------------------------------------------+----------+----------------+ | :ref:`Quality filtering ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Denoise ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Merge paired-end reads ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Chimera filtering ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Remove tag-jumps ` | UNCROSS2 | | +-----------------------------------------------------+----------+----------------+ | :ref:`Merge sequencing runs* ` | DADA2 | 1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Taxonomy assignment ` | RDP | in DADA2 v1.30 | +-----------------------------------------------------+----------+----------------+ | :ref:`Clustering ASVs to OTUs ` | vsearch | 2.28.1 | +-----------------------------------------------------+----------+----------------+ | :ref:`Post-clusteringlustering ` | BLAST | 2.12.0+ | +-----------------------------------------------------+----------+----------------+ | :ref:`Post-clusteringlustering ` | LULU | 0.1.0+ | +-----------------------------------------------------+----------+----------------+ \*only applicable when there are multiple sequencing runs per study. .. note:: All the dependencies/software of the pipeline are available on a `Docker image `_. | Download `Docker for windows `_ | Download `Docker for Mac `_ | Install Docker for Linux - `follow the guidelines under appropriate Linux distribution `_ .. code-block:: bash :caption: get the Docker image docker pull pipecraft/bioscanflow:2 .. code-block:: bash :caption: example of running the pipeline via Docker image # run docker # specify the files location with -v flag ($PWD = the current working directory) docker run -i --tty -v $PWD/:/Files pipecraft/bioscanflow:2 # inside the container, the files are accessible in the /Files dir cd Files # checking if cutadapt is available cutadapt -h # ready to run the pipe as below ... ## make sure that via the shared folder (-v) path you have access also to the reference databases. Data structure ~~~~~~~~~~~~~~ .. _multiRunDir16S: Multiple sequencing runs ------------------------ .. important:: When aiming to combine samples from multiple sequencing runs, then follow the below directory structure **Directory structure:** | **/multiRunDir** *(directory names can be changed)* | ├── **/sequencing_set01** | │ ├── *sample1.R1.fastq* | │ ├── *sample1.R2.fastq* | │ ├── *sample2.R1.fastq* | │ ├── *sample2.R2.fastq* | │ ├── ... | ├── **/sequencing_set02** | │ ├── *sampleA.R1.fastq* | │ ├── *sampleA.R2.fastq* | │ ├── *sampleB.R1.fastq* | │ ├── *sampleB.R2.fastq* | │ ├── ... | └── **/sequencing_set03** | ├── *sample11.R1.fastq* | ├── *sample11.R2.fastq* | ├── *sample12.R1.fastq* | ├── *sample12.R2.fastq* | ├── ... .. note:: Fastq files with the **same name** will be considered as the same sample and will be merged in the "Merge sequencing runs" step. Single sequencing run --------------------- | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` | ____________________________________________________ .. _remove_primers16S: Remove primers ~~~~~~~~~~~~~~ | Remove primer strings from paired-end data. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. note:: Here, assuming that all sequences are in 5'-3' orientation! *(3'-5' orient sequences will be discarded with this workflow)* .. important:: | - Paired-end files must contain "R1" and "R2" strings (not just _1 and _2)! | - Sample names must not contain "R1" and "R2" strings (i.e. not FR123_001_R1.fastq/FR123_001_R2.fastq) Below, the script is specifying the primers for the 16S data analyses. **For 18S and 12S, just replace the primer strings and run the script**. .. code-block:: bash :caption: remove primers with cutadapt :emphasize-lines: 25-30, 55-56 :linenos: #!/bin/bash ## workflow to remove primers via cutadapt # My working folder = /multiRunDir (see dir structure above) # specify the identifier string for the R1 files read_R1="_R1" # specify 16S primers fwd_primer=$"GTGYCAGCMGCCGCGGTAA" #this is primer 515F for 16S rev_primer=$"GGCCGYCAATTYMTTTRAGTTT" #this is primer 926R for 16S # specify 12S primers #fwd_primer=$"GTGCCAGCNRCCGCGGTYANAC" #this is primer V16S-U-fw #rev_primer=$"ATAGTRGGGTATCTAATCCYAGT" #this is primer V16S-U-rw # specify 18S primers #fwd_primer=$"CCAGCASCYGCGGTAATTCC" #this is primer TAReuk454FWD1 #rev_primer=$"ACTTTCGTTCTTGATYRA" #this is primer TAReukREV3 # edit primer trimming settings maximum_error_rate="2" # Maximum error rate in primer string search; # if set as 1, then allow 1 mismatch; # if set as 0.1, then allow mismatch in 10% of the bases, # i.e. if a primer is 20 bp then allowing 2 mismatches. overlap="19" # The minimum overlap length. Keep it nearly as high # as the primer length to avoid short random matches. pair_filter="any" # Option 'any' discards a read pair if primers are not found in # either of the read pairs (R1 and R2). # Option 'both' keeps the read pair if a primer is found in # at least one of the read pairs. # get directory names if working with multiple sequencing runs DIRS=$(ls -d *) # -> sequencing_set01 sequencing_set02 sequencing_set03 for sequencing_run in $DIRS; do printf "\nWorking with $sequencing_run \n" cd $sequencing_run #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # make output dirs mkdir -p primersCut_out mkdir -p primersCut_out/untrimmed ### Clip primers with cutadapt for inputR1 in *$read_R1*; do inputR2=$(echo $inputR1 | sed -e 's/R1/R2/') cutadapt --quiet \ -e $maximum_error_rate \ --minimum-length 32 \ --overlap $overlap \ --no-indels \ --cores=0 \ --untrimmed-output primersCut_out/untrimmed/$inputR1 \ --untrimmed-paired-output primersCut_out/untrimmed/$inputR2 \ --pair-filter=$pair_filter \ -g $fwd_primer \ -G $rev_primer \ -o primersCut_out/$inputR1 \ -p primersCut_out/$inputR2 \ $inputR1 $inputR2 done #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# cd .. done ____________________________________________________ .. _quality_filtering16S: Quality filtering ~~~~~~~~~~~~~~~~~ | Quality filtering of the fastq files based on the allowed maximum error rate per sequence (as in DADA2). | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: quality filtering in DADA2 (in R) :emphasize-lines: 13-19, 67-71 :linenos: #!/usr/bin/Rscript ## workflow to perform quality filtering within DADA2 #load dada2 library library('dada2') # specify the identifier string for the R1 files read_R1 = "_R1" # get the identifier string for the R2 files read_R2 = gsub("R1", "R2", read_R1) # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # output path path_results = "qualFiltered_out" # input and output file paths R1s = sort(list.files("primersCut_out", pattern = read_R1, full.names = TRUE)) R2s = sort(list.files("primersCut_out", pattern = read_R2, full.names = TRUE)) #sample names sample_names = sapply(strsplit(basename(R1s), read_R1), `[`, 1) # filtered files path filtR1 = file.path(path_results, paste0(sample_names, ".R1.", "fastq.gz")) filtR2 = file.path(path_results, paste0(sample_names, ".R2.", "fastq.gz")) names(filtR1) = sample_names names(filtR2) = sample_names #quality filtering qfilt = filterAndTrim(R1s, filtR1, R2s, filtR2, maxN = 0, # max number of allowed N bases. maxEE = c(2, 2), # max error rate per R1 and R2 read, respectively. truncQ = 2, # truncate reads at the first instance of a quality score less than or equal to specified value. truncLen = c(0, 0), # truncate reads after specified length for R1 and R2 reads, respectively. maxLen = 600, # discard reads longer than specified. minLen = 100, # discard reads shorter than specified. minQ = 2, # discard reads (after truncation) that contain a quality score below specified value. matchIDs = TRUE, # output paired-end reads with matching IDs (for merging). compress = TRUE, # gzip the output multithread = TRUE) # use multiple threads saveRDS(qfilt, file.path(path_results, "qfilt_reads.rds")) # make sequence count report seq_count = cbind(qfilt) colnames(seq_count) = c("input", "qualFiltered") seq_count = as.data.frame(seq_count) seq_count$sample = sample_names # reorder columns seq_count = seq_count[, c("sample", "input", "qualFiltered")] write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = FALSE, quote = FALSE) # save filtered R objects for denoising and merging (below) filtR1 = sort(list.files(path_results, pattern = ".R1.fastq.gz", full.names = TRUE)) filtR2 = sort(list.files(path_results, pattern = ".R2.fastq.gz", full.names = TRUE)) sample_names = sapply(strsplit(basename(filtR1), ".R1.fastq.gz"), `[`, 1) saveRDS(filtR1, file.path(path_results, "filtR1.rds")) saveRDS(filtR2, file.path(path_results, "filtR2.rds")) saveRDS(sample_names, file.path(path_results, "sample_names.rds")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #set working directory back to "/multiRunDir" setwd(wd) i = i + 1 } } ____________________________________________________ .. _denoise16S: Denoise and merge paired-end reads ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | Denoise and merge paired-end Illumina reads as in DADA2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: denoise and merge paired-end reads in DADA2 :emphasize-lines: 7-13, 73-77 :linenos: #!/usr/bin/Rscript ## workflow to perform DADA2 denoising and merging # load dada2 library library('dada2') # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# #load quality filtered files filtR1 = readRDS("qualFiltered_out/filtR1.rds") filtR2 = readRDS("qualFiltered_out/filtR2.rds") qfilt = readRDS("qualFiltered_out/qfilt_reads.rds") sample_names = readRDS("qualFiltered_out/sample_names.rds") cat("\n sample names = ", sample_names, "\n ") names(filtR1) = sample_names names(filtR2) = sample_names # create output dir path_results = "denoised_merged" dir.create(path_results, showWarnings = FALSE) print("# Denoising ...") # learn the error rates errF = learnErrors(filtR1, multithread = TRUE) errR = learnErrors(filtR2, multithread = TRUE) # make error rate figures pdf(file.path(path_results, "Error_rates_R1.pdf")) print( plotErrors(errF) ) dev.off() pdf(file.path(path_results, "Error_rates_R2.pdf")) print( plotErrors(errR) ) dev.off() # Sample inference and merger of paired-end reads mergers = vector("list", length(sample_names)) names(mergers) = sample_names for(sample in sample_names) { cat("\n -- Processing:", sample, "\n") derepF = derepFastq(filtR1[[sample]]) ddF = dada(derepF, err = errF, multithread = TRUE) derepR = derepFastq(filtR2[[sample]]) ddR = dada(derepR, err = errR, multithread = TRUE) merger = mergePairs(ddF, derepF, ddR, derepR) mergers[[sample]] = merger } rm(derepF); rm(derepR) gc() saveRDS(mergers, (file.path(path_results, "mergers.rds"))) # make sequence table ASV_tab = makeSequenceTable(mergers) #write RDS object saveRDS(ASV_tab, (file.path(path_results, "rawASV_table.rds"))) # make sequence count report getN = function(x) sum(getUniques(x)) #remove 0 seqs samples from qfilt statistics row_sub = apply(qfilt, 1, function(row) all(row !=0 )) qfilt = qfilt[row_sub, ] seq_count = cbind(qfilt, sapply(mergers, getN)) colnames(seq_count) = c("input", "qualFiltered", "denoised_and_merged") rownames(seq_count) = sample_names write.csv(seq_count, file.path(path_results, "seq_count_summary.csv"), row.names = TRUE, quote = FALSE) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _remove_chimeras16S: Chimera filtering ~~~~~~~~~~~~~~~~~ | Remove putative chimeras with DADA2 'consensus' mode. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: remove chimeras in DADA2 :emphasize-lines: 14-20, 97-100 :linenos: #!/usr/bin/Rscript ## workflow to perform chimera filtering within DADA2 # load libraries library('dada2') library('openssl') # chimera filtering method method = "consensus" # collapse ASVs that have no mismatshes or internal indels (identical up to shifts and/or length) collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load denoised and merged ASVs rawASV_table = readRDS("denoised_merged/rawASV_table.rds") # create output dir path_results="ASV_table" dir.create(path_results, showWarnings = FALSE) # Remove chimeras print("Removing chimeric ASVs ...") chim_filt = removeBimeraDenovo( rawASV_table, method = method, multithread = TRUE, verbose = TRUE) saveRDS(chim_filt, "ASV_table/chim_filt.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(chim_filt) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tchim_filt = t(chim_filt) # add sequences to 1st column tchim_filt = cbind(row.names(tchim_filt), tchim_filt) colnames(tchim_filt)[1] = "Sequence" # row names as sequence headers row.names(tchim_filt) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs.fasta")) # write ASVs table to path_results write.table(tchim_filt, file.path(path_results, "ASV_table.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) ### collapse ASVs that have no mismatshes or internal indels # (identical up to shifts and/or length) if (collapseNoMismatch == "true") { print("Collapsing identical ASVs ...") ASV_tab_collapsed = collapseNoMismatch(chim_filt, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(ASV_tab_collapsed, file.path(path_results, "ASV_table_collapsed.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(ASV_tab_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tASV_tab_collapsed = t(ASV_tab_collapsed) # add sequences to 1st column tASV_tab_collapsed = cbind(row.names(tASV_tab_collapsed), tASV_tab_collapsed) colnames(tASV_tab_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tASV_tab_collapsed) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_collapsed.fasta")) # write ASVs table to path_results write.table(tASV_tab_collapsed, file.path(path_results, "ASVs_table_collapsed.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(ASV_tab_collapsed)), " chimera filtered (+collapsed) ASVs, with a total of ", sum(rowSums(ASV_tab_collapsed)), " sequences.")) print("--------") } else { # print summary print(paste0("Output = ", length(colnames(chim_filt)), " chimera filtered ASVs, with a total of ", sum(rowSums(chim_filt)), " sequences.")) print("--------") } #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# setwd(wd) i = i + 1 } } ____________________________________________________ .. _tagjumps16S: Remove tag-jumps ~~~~~~~~~~~~~~~~ | Remove putative tag-jumps with UNCROSS2. | | When working with a **single directory** that hosts your fastq files, then | :yellow-background:`ignore (do not execute) the script lines in yellow.` .. code-block:: R :caption: removing putative tag-jumps with UNCROSS2 method :emphasize-lines: 15-21, 115-119 :linenos: #!/usr/bin/Rscript ## Script to perform tag-jump removal; (C) Vladimir Mikryukov, # edit, Sten Anslan # load libraries library(data.table) # set parameters set_f = 0.03 # f-parameter of UNCROSS (e.g., 0.03) set_p = 1 # p-parameter (e.g., 1.0) # output dir path_results="ASV_table" # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) print(paste0("Working with ", dirs[i])) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# # load ASV table # loading ASV_table_collapsed if collapseNoMismatch was "true" (above) if (file.exists("ASV_table/ASV_table_collapsed.rds") == TRUE) { tab = readRDS("ASV_table/ASV_table_collapsed.rds") cat("input table = ASV_table/ASV_table_collapsed.rds\n") } else { # loading chimera filtered ASV table tab = readRDS("ASV_table/chim_filt.rds") cat("input table = ASV_table/chim_filt.rds\n") } # format ASV table ASVTABW = as.data.table(t(tab), keep.rownames = TRUE) colnames(ASVTABW)[1] = "ASV" # convert to long format ASVTAB = melt(data = ASVTABW, id.vars = "ASV", variable.name = "SampleID", value.name = "Abundance") # remove zero-OTUs ASVTAB = ASVTAB[ Abundance > 0 ] # estimate total abundance of sequence per plate ASVTAB[ , Total := sum(Abundance, na.rm = TRUE), by = "ASV" ] ## UNCROSS score uncross_score = function(x, N, n, f = 0.01, tmin = 0.1, p = 1){ z = f * N / n # Expected treshold sc = 2 / (1 + exp(x/z)^p) # t-score res = data.table(Score = sc, TagJump = sc >= tmin) return(res) } # esimate UNCROSS score cat(" estimating UNCROSS score\n") ASVTAB = cbind( ASVTAB, uncross_score( x = ASVTAB$Abundance, N = ASVTAB$Total, n = length(unique(ASVTAB$SampleID)), f = as.numeric(set_f), p = as.numeric(set_p) ) ) cat(" number of tag-jumps: ", sum(ASVTAB$TagJump, na.rm = TRUE), "\n") # tag-jump stats TJ = data.table( Total_reads = sum(ASVTAB$Abundance), Number_of_TagJump_Events = sum(ASVTAB$TagJump), TagJump_reads = sum(ASVTAB[ TagJump == TRUE ]$Abundance, na.rm = T)) TJ$ReadPercent_removed = with(TJ, (TagJump_reads / Total_reads * 100)) fwrite(x = TJ, file = "ASV_table/TagJump_stats.txt", sep = "\t") # prepare ASV tables, remove tag-jumps ASVTAB = ASVTAB[ TagJump == FALSE ] # convert to wide format RES = dcast(data = ASVTAB, formula = ASV ~ SampleID, value.var = "Abundance", fill = 0) # sort rows (by total abundance) clz = colnames(RES)[-1] otu_sums = rowSums(RES[, ..clz], na.rm = TRUE) RES = RES[ order(otu_sums, decreasing = TRUE) ] # output table that is compadible with dada2 output = as.matrix(RES, sep = "\t", header = TRUE, rownames = 1, check.names = FALSE, quote = FALSE) output = t(output) saveRDS(output, ("ASV_table/ASV_table_TagJumpFiltered.rds")) ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(output) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table toutput = t(output) # add sequences to 1st column toutput = cbind(row.names(toutput), toutput) colnames(toutput)[1] = "Sequence" #row names as sequence headers row.names(toutput) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, file.path(path_results, "ASVs_TagJumpFiltered.fasta")) # write ASVs table to path_results write.table(toutput, file.path(path_results, "ASV_table_TagJumpFiltered.txt"), sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(output)), " ASVs, with a total of ", sum(rowSums(output)), " sequences.")) #-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-#-# print("--------") setwd(wd) i = i + 1 } } ____________________________________________________ .. _mergeRuns16S: Merge sequencing runs ~~~~~~~~~~~~~~~~~~~~~ | If previous processing was applied on :ref:`multiple sequencing runs ` , then here, | merge those sequenceing runs to form a single, unified ASV table. | Assuming that tag-jump filtering was performed (inputs = ASV_table_TagJumpFiltered.rds) .. code-block:: R :caption: merge ASV tables from multiple sequencing runs :emphasize-lines: 1-88 :linenos: #!/usr/bin/Rscript ## Merge sequencing runs, if working with multiple ones # load libraries library('dada2') # after merging multiple ASV tables ... # collapse ASVs that have no mismatshes or internal indels collapseNoMismatch = "true" #true/false # capturing the directory structure when working with multiple runs wd = getwd() # -> wd is "~/multiRunDir" dirs = list.dirs(recursive = FALSE) tables = c() # load tables from multiple sequencing runs (dirs) for (i in 1:length(dirs)) { if(length(dirs) > 1) { setwd(dirs[i]) tables = append(tables, print(file.path(paste0(dirs[i], "/ASV_table"), "ASV_table_TagJumpFiltered.rds"))) setwd(wd) i = i + 1 } } # Merge multiple ASV tables print("# Merging multiple ASV tables ...") ASV_tables = lapply(tables, readRDS) merged_table = mergeSequenceTables(tables = ASV_tables, repeats = "sum", tryRC = FALSE) ### collapse ASVs that have no mismatshes or internal indels if (collapseNoMismatch == "true") { print("# Collapsing identical ASVs ...") merged_table_collapsed = collapseNoMismatch(merged_table, minOverlap = 20, orderBy = "abundance", identicalOnly = FALSE, vec = TRUE, band = -1, verbose = TRUE) saveRDS(merged_table_collapsed, "merged_table_collapsed.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table_collapsed) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table_collapsed = t(merged_table_collapsed) # add sequences to 1st column tmerged_table_collapsed = cbind(row.names(tmerged_table_collapsed), tmerged_table_collapsed) colnames(tmerged_table_collapsed)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table_collapsed) = asv_headers # write ASVs.fasta asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged_collapsed.fasta") # write ASVs table write.table(tmerged_table_collapsed, "ASV_table.merged_collapsed.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table_collapsed)), " ASVs, with a total of ", sum(rowSums(merged_table_collapsed)), " sequences.")) } else { saveRDS(merged_table, "merged_table.rds") ### format and save ASV table and ASVs.fasta files # sequence headers asv_seqs = colnames(merged_table) asv_headers = openssl::sha1(asv_seqs) # transpose sequence table tmerged_table = t(merged_table) # add sequences to 1st column tmerged_table = cbind(row.names(tmerged_table), tmerged_table) colnames(tmerged_table)[1] = "Sequence" #row names as sequence headers row.names(tmerged_table) = asv_headers # write ASVs.fasta to path_results asv_fasta = c(rbind(paste(">", asv_headers, sep=""), asv_seqs)) write(asv_fasta, "ASVs.merged.fasta") # write ASVs table to path_results write.table(tmerged_table, "ASV_table.merged.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) # print summary print(paste0("Output = ", length(colnames(merged_table)), " ASVs, with a total of ", sum(rowSums(merged_table)), " sequences.")) } ____________________________________________________ .. _taxAssign16S: Taxonomy assignment ~~~~~~~~~~~~~~~~~~~ | Assign taxonomy with **RDP classifier** (assignTaxonomy function in DADA2). | Herein using the `SILVA138.2 reference database `_ .. note:: Below, the script 'assignTaxonomy' is specifying the database used for the 16S data analyses. **For 18S and 12S, replace the reference database.** For 12S, if targeting **fish** one can use the `MitoFish database `_ with Sintax as in the script 'assign Taxonomy with SYNTAX'. For 18S, one can use the `Eukaryome reference database <"https://sisu.ut.ee/wp-content/uploads/sites/643/SINTAX_EUK_SSU_v2.0.zip">`_ with Sintax as in the script 'assign Taxonomy with SYNTAX'. .. code-block:: bash :caption: assignTaxonomy :linenos: #!/usr/bin/env Rscript # specify the query fasta file fasta=$"ASVs_TagJumpFiltered.fasta" # specify reference database reference_database="silva_nr99_v138.2_toSpecies_trainset.fa.gz" # load fasta file library("seqinr") fasta = read.fasta(fasta, seqtype = "DNA", as.string = TRUE, forceDNAtolower = FALSE, seqonly = FALSE) seq_names = getName(fasta) seqs = unlist(getSequence(fasta, as.string = TRUE)) # print number of ASVs paste("Number of input sequences = ", length(seq_names)) # assign taxonomy with dada2 'assignTaxonomy' library("dada2") set.seed(1) tax = assignTaxonomy(seqs, reference_database, multithread = TRUE, minBoot = 80, tryRC = TRUE, outputBootstraps = TRUE) # format and export taxonomy table tax_out = cbind(seq_names, tax$tax, tax$boot) colnames(tax_out)[1] = "Seq_name" write.table(tax_out, "taxonomy.txt", sep = "\t", col.names = NA, row.names = TRUE, quote = FALSE) ### output = taxonomy.txt .. code-block:: bash :caption: assign Taxonomy with SYNTAX :linenos: #!/bin/bash # download the EUKARYOME reference databse wget \ "https://sisu.ut.ee/wp-content/uploads/sites/643/SINTAX_EUK_SSU_v2.0.zip" # unzip the database and edit name unzip SINTAX_EUK_ITS_v2.0.zip -d EUKARYOME2.0_sintax # specify reference database reference_database="EUKARYOME2.0_sintax/SINTAX_EUK_SSU_v2.0.fasta" reference_database=$(realpath $reference_database) # get database names with full path # specify input fasta file cd ASV_table ASV_fasta="ASVs.ITSx.fasta" # Run SINTAX classifier vsearch -sintax $ASV_fasta \ -db $reference_database \ -tabbedout taxonomy.txt \ -strand both \ -sintax_cutoff 0.8 \ --threads 8 ____________________________________________________ .. note:: The final ASVs data is ``ASV_table_TagJumpFiltered.txt`` and ``ASVs_TagJumpFiltered.fasta`` in the ``ASV_table`` directory. In case of Merging multiple sequencing runs, the ASVs data is ``ASV_table.merged.txt`` / ``ASV_table.merged_collapsed.txt`` and ``ASVs.merged.fasta`` / ``ASVs.merged_collapsed.fasta`` in the ``multiRunDir``. ____________________________________________________ .. _clustering16S: Clustering ASVs to OTUs ~~~~~~~~~~~~~~~~~~~~~~~ | Clustering ASVs to OTUs with vsearch. .. code-block:: R :caption: get the size of ASVs :linenos: #!/usr/bin/env Rscript # specify input ASVs table and fasta ASV_table="ASV_table_TagJumpFiltered.txt" # specify ASV table file ASV_fasta="ASVs_TagJumpFiltered.fasta" # specify ASVs fasta file ################################ # remove "Sequence" column from the ASV table (if present) ASV_table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1) # check ASV table; if 1st col is sequence, then remove it if (colnames(ASV_table)[1] == "Sequence") { cat("## removing 'Sequence' column ... \n") ASV_table = ASV_table[, -1] } ################################ # add size annototation to ASV seqs library(Biostrings) # add 'sum' column ASV_table$sum = rowSums(ASV_table) # make ASV_sums object ASV_sums = setNames(ASV_table$sum, rownames(ASV_table)) # Read the FASTA file ASV_fasta = readDNAStringSet(ASV_fasta) # add ";size=*" to ASV_fasta names(ASV_fasta) = sapply(names(ASV_fasta), function(header) { paste0(header, ";size=", ASV_sums[header]) }) # write fasta file writeXStringSet(ASV_fasta, "ASVs.size.fasta", width = max(width(ASV_fasta))) .. code-block:: bash :caption: clustering with vsearch :linenos: #!/bin/bash # specify the clustering threshold clustering_thresh="0.97" # make output dir. output_dir="OTU_table" mkdir -p $output_dir export output_dir ### cluster ASVs using vsearch. vsearch --cluster_fast ASVs.size.fasta \ --id $clustering_thresh \ --iddef 2 \ --sizein \ --xsize \ --fasta_width 0 \ --centroids $output_dir/OTUs.fasta \ --uc $output_dir/OTUs.uc .. code-block:: R :caption: generate an OTU table based on the clustered ASVs (.uc file). :linenos: #!/usr/bin/Rscript # specify input ASV table (the same one as for 'get the size of ASVs') ASV_table="ASV_table_TagJumpFiltered.txt" # read output dir output_dir = Sys.getenv('output_dir') # read output from vsearch clustering (-uc OTU.uc) inp_UC = file.path(output_dir, "OTUs.uc") ################################ library(data.table) # load input data - ASV table ASV_table = fread(file = ASV_table, header = TRUE, sep = "\t") # check ASV table; if 2nd col is 'Sequence', then remove it if (colnames(ASV_table)[2] == "Sequence") { cat("## removing 'Sequence' column ... \n") ASV_table = ASV_table[, -2] } ## Load input data - UC mapping file UC = fread(file = inp_UC, header = FALSE, sep = "\t") UC = UC[ V1 != "S" ] UC[, ASV := tstrsplit(V9, ";", keep = 1) ] UC[, OTU := tstrsplit(V10, ";", keep = 1) ] UC[V1 == "C", OTU := ASV ] UC = UC[, .(ASV, OTU)] # convert ASV table to long format ASV = melt(data = ASV_table, id.vars = colnames(ASV_table)[1], variable.name = "SampleID", value.name = "Abundance") ASV = ASV[ Abundance > 0 ] # add colnames, to make sure 1st is 'ASV' colnames(ASV) = c("ASV", "SampleID", "Abundance") # add OTU IDs ASV = merge(x = ASV, y = UC, by = "ASV", all.x = TRUE) # summarize OTU = ASV[ , .(Abundance = sum(Abundance, na.rm = TRUE)), by = c("SampleID", "OTU")] # reshape OTU table to wide format OTU_table = dcast(data = ASV, formula = OTU ~ SampleID, value.var = "Abundance", fun.aggregate = sum, fill = 0) # write OTU table # OTU names correspond to most abundant ASV in an OTU fwrite(x = OTU_table, file = file.path(output_dir, "OTU_table.txt"), sep = "\t") ____________________________________________________ .. _postclustering16S: Post-clustering ~~~~~~~~~~~~~~~ Post-cluster OTUs with LULU to merge consistently co-occurring 'daughter-OTUs'. .. code-block:: bash :caption: generate match list for post-clustering :linenos: #!/bin/bash # go to directrory that contains OTUs cd $output_dir # 'OTU_table' in this case # make blast database for post-clustering makeblastdb -in OTUs.fasta -parse_seqids -dbtype nucl # generate match list for post-clustering blastn -db OTUs.fasta \ -outfmt '6 qseqid sseqid pident' \ -out match_list.txt \ -qcov_hsp_perc 75 \ -perc_identity 90 \ -query OTUs.fasta \ -num_threads 20 .. code-block:: R :caption: run LULU post-clustering :linenos: #!/usr/bin/Rscript # specify minimum threshold of sequence similarity considering any OTU as an error of another min_match = "90" # specify OTU table OTU_table="OTU_table.txt" ################################ library(devtools) # load OTU table and match list otutable = read.table(OTU_table, header = T, row.names = 1, sep = "\t") matchlist = read.table("match_list.txt") curated_result = lulu::lulu(otutable, matchlist, minimum_match = min_match) # write post-clustered OTU table to file curated_table = curated_result$curated_table curated_table = cbind(OTU = rownames(curated_table), curated_table) write.table(curated_table, file ="OTU_table_LULU.txt", sep = "\t", row.names = F, quote = FALSE) write.table(curated_result$discarded_otus, file ="merged_units.lulu", col.names = FALSE, quote = FALSE) .. note:: Note that if the sample names start with a number, then the output OTU table may contain "X" prefix in the sample names. .. code-block:: bash :caption: match OTUs.fasta and taxonomy table with post-clustered table (OTU_table_LULU) :linenos: #!/bin/bash # specify post-clustered table OTU_table="OTU_table_LULU.txt" # specify pre post-clustered OTUs fasta file OTUs_fasta="OTUs.fasta" # get matching OTUs awk 'NR>1{print $1}' $OTU_table > OTUs_LULU.list cat $OTUs_fasta | \ seqkit grep -w 0 -f OTUs_LULU.list > OTUs_LULU.fasta # get matching taxonomy results for OTUs head -n 1 ../taxonomy.txt > taxonomy.txt cat ../taxonomy.txt | \ grep -wf OTUs_LULU.list >> taxonomy.txt # remove unnecessary files rm OTUs.fasta.n* # move OTU_table two directories down cd .. mv $output_dir ../.. .. note:: The final OTUs data is ``OTU_table_LULU.txt`` and ``OTUs_LULU.fasta`` in the ``OTU_table`` directory. The matching taxonomy file is ``taxonomy.txt`` in the ``OTU_table`` directory. Check for the non-target OTUs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is often the case that universal metabarcoding primers amplify also non-target DNA regions and/or non-target taxa. Here, we are interesed only in Bacteria/Archaea, thus we should get rid of the off-target noise before proceeding with relevant statistical analyses. For that, `check this section in the PipeCraft2 documentation `_. ____________________________________________________ We acknowledge `CSC - IT Center for Science `_, Finland, for computational resources while building and testing this workflows. ____________________________________________________ |logo_BGE_small| |eufund| |chfund| |ukrifund|