Other taxonomy assignment tools

BOLDdigger3 

BOLDigger3 automates the taxonomic identification of DNA sequences against the Barcode of Life Data System (BOLD v5) databases, using BOLD’s identification service.

BOLDdigger3

 #!/bin/bash

 # specify the query fasta file
 fasta=$"ASVs.fasta"

 # run BOLDdigger3
 boldigger3 identify $fasta --db 2 --mode 3 --thresholds 97 95 90 85

 ## Or run BOLDdigger3 by specifying the python version if you have multiple ones installed
 # py -3.13 -m boldigger3 identify cox1renamedkoopia.fasta --db 2 --mode 3 --thresholds 97 95

--db 2 specifies the BOLD v5 database of “ANIMAL SPECIES-LEVEL LIBRARY (PUBLIC + PRIVATE)”,
--mode 3 specifies the mode of identification if “Exhaustive Search”.
--thresholds 97 95 90 85 specify the thresholds for Species-, Genus-, Family-, and Order-level identifications.

Check the BOLDdigger3 documentation for more details about the options.

BLAST 

Reference database is a fasta file; no special formatting is required.

BLAST

 #!/bin/bash

 # specify the query fasta file
 fasta=$"ASVs.fasta"

 # specify reference database for BLAST
 reference_database="../CO1Classifier_v5.1.0/CO1Classifier_v5.1.0.fasta"
 reference_database=$(realpath $reference_database) # get full directory path

 ## if the database is just in fasta format, then convert it to BLAST format

 ### Check and assign BLAST database
 d1=$(echo $reference_database | awk 'BEGIN{FS=OFS="."}{print $NF}')
 #make blast database if db is not formatted for BLAST
 db_dir=$(dirname $reference_database)
 check_db_presence=$(ls -1 $db_dir/*.nhr 2>/dev/null | wc -l)
 if (( $check_db_presence != 0 )); then
     if [[ $d1 == "fasta" ]] || [[ $d1 == "fa" ]] || \
         [[ $d1 == "fas" ]] || [[ $d1 == "fna" ]] || \
         [[ $d1 == "ffn" ]]; then
         database=$"-db $reference_database"
     elif [[ $d1 == "ndb" ]] || [[ $d1 == "nhr" ]] || \
          [[ $d1 == "nin" ]] || [[ $d1 == "not" ]] || \
          [[ $d1 == "nsq" ]] || [[ $d1 == "ntf" ]] || \
          [[ $d1 == "nto" ]]; then
         reference_database=$(echo $reference_database | awk 'BEGIN{FS=OFS="."}NF{NF-=1};1')
         database=$"-db $reference_database"
     fi
 elif [[ $d1 == "fasta" ]] || [[ $d1 == "fa" ]] || \
      [[ $d1 == "fas" ]] || [[ $d1 == "fna" ]] || \
      [[ $d1 == "ffn" ]]; then
         printf '%s\n' "Note: converting fasta formatted database for BLAST"
         makeblastdb -in $reference_database -input_type fasta -dbtype nucl
         database=$"-db $reference_database"
 fi

 #BLAST
 printf '%s\n' "# Running BLAST for $(grep -c "^>" $fasta) sequences"
 blastn -strand plus \
             -num_threads 20 \
             -query $fasta \
             $database \
             -out 10BestHits.txt -task blastn \
             -max_target_seqs 10 -evalue=0.001 \
             -word_size=7 -reward=1 \
             -penalty=-1 -gapopen=1 -gapextend=2 \
 -outfmt "6 qseqid stitle qlen slen qstart qend sstart send evalue length nident mismatch gapopen gaps sstrand qcovs pident"

 #qseqid = Query Seq-id
 #qlen = Query sequence length
 #sacc = Subject accession
 #slen = Subject sequence length
 #qstart = Start of alignment in query
 #qend = End of alignment in query
 #sstart = Start of alignment in subject
 #send = End of alignment in subject
 #evalue = Expect value
 #length = Alignment length
 #pident = Percentage of identical matches
 #nident = Number of identical matches
 #mismatch = Number of mismatches
 #gapopen = Number of gap openings
 #gaps = Total number of gaps
 #1st_hit = BLAST 1st hit
 #sstrand = Subject Strand
 #qcovs = Query Coverage Per Subject

 ### parse BLAST 1st hit
 awk 'BEGIN{FS="\t"}''!seen[$1]++' 10BestHits.txt > BLAST_1st_hit.txt
 #check which seqs got a hit
 gawk 'BEGIN{FS="\t"}{print $1}' < BLAST_1st_hit.txt | \
     uniq > gothits.names
 #add no_hits flag
 seqkit seq -n $fasta > $fasta.names
 grep -v -w -F -f gothits.names $fasta.names | \
     sed -e 's/$/\tNo_significant_similarity_found/' >> BLAST_1st_hit.txt
 #add header
 sed -i '1 i\
 qseqid\t1st_hit\tqlen\tslen\tqstart\tqend\tsstart\tsend\tevalue\tlength\tnident\tmismatch\tgapopen\tgaps\tsstrand\tqcovs\tpident' \
 BLAST_1st_hit.txt

 #remove unnecessary files
 rm *.names

RDP-classifier 

Taxonomy assignment with the RDP-classifier against CO1Classifier v5.1.0 database.

— Download the CO1Classifier v5.1.0 for RDP here (click) —

assign taxonomy with RDP-classifier

 #!/bin/bash

 # download the CO1Classifier reference databse
 wget \
   "https://github.com/terrimporter/CO1Classifier/releases/download/RDP-COI-v5.1.0/RDP_COIv5.1.0.zip"
 # unzip the database and edit name
 unzip RDP_COIv5.1.0.zip && mv mydata CO1Classifier_v5.1.0_RDP

 # specify reference database for RDP
 reference_database="CO1Classifier_v5.1.0_RDP/rRNAClassifier.properties"
 reference_database=$(realpath $reference_database) # get database names with full path

 # specify input fasta file
 cd ASV_table
 ASV_fasta="ASVs_TagJumpFiltered.fasta"
 ASV_fasta_tmp="ASVs_TagJumpFiltered_minmax.fasta"

 # select by size to only retain ASVs that are within the range of
 # expected variation. In this case we set it to 400 up to 430 bps
 vsearch --fastx_filter $ASV_fasta \
         --fastq_minlen 400 \
         --fastq_maxlen 430 \
         --fastaout $ASV_fasta_tmp

 mv $ASV_fasta_tmp $ASV_fasta

 # Run RDP-classifier
 time rdp_classifier \
         -Xmx12g \
         classify \
         -t $reference_database \
         -f allrank \
         -o RDP.taxonomy.txt \
         -q $ASV_fasta

Get target taxa from RDP-classifier results

This part filters the ASV dataset to include only target taxonomic group for the following analyses.

For example, if you are interested in Metazoa, then discard all ASVs that do not match to the target taxon based on the user defined threshold (default = 0.8).

get only target taxon annotations

 #!/usr/bin/env Rscript
 ### Filter dataset based on RDP classifier results to include target taxa

 # specify taxon and threshold
 taxon="Metazoa"      # target taxonomic group(s);
                      # when specifying multiple groups, they should be from the same taxonomic
                      # level. Separator is "," (e.g., "Hymenoptera, Lepidoptera")
 tax_level="kingdom"  # allowed levels: kingdom | phylum | class | order | family | genus
 threshold="0.8"      # threshold for considering an ASV as a target taxon;
                      # default is 0.8

 # species_bootstrap = 0.9 # threshold for species-level identification;
                           # default is 0.9

 # specify the ASV table and ASVs.fasta file that would be filtered to include only target taxa
 ASV_fasta = "ASVs_TagJumpFiltered.fasta"
 ASV_table = "ASV_table_TagJumpFiltered.txt"

 # specify the RDP-classifier output file (taxonomy file)
 taxtab="RDP.taxonomy.txt"

 #--------------------------------------#
 library(stringr)
 library(dplyr)

 # read ASV table
 table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1)

 # read taxonomy table
 tax = read.table(taxtab, sep = "\t", check.names = F, row.names = 1)
 cat("\n Input =", nrow(tax), "features.\n")
 # remove not needed columns from tax dataframe
 tax = tax[, -c(1, 2, 3, 4, 6, 9, 12, 15, 18, 21, 24, 27)]
 # assign colnames for tax
 colnames(tax) = c("superkingdom", "superkingdom_BootS",
                 "kingdom", "kingdom_BootS",
                 "phylum","phylum_BootS",
                 "class", "class_BootS",
                 "order", "order_BootS",
                 "family", "family_BootS",
                 "genus", "genus_BootS",
                 "species", "species_BootS")

 # taxon list
 taxon_list = strsplit(taxon, ", ")[[1]]

 ### extract only target-taxon ASVs from the 'raw' RDP results
 tax_filtered = tax %>%
     filter(.data[[tax_level]] %in% taxon_list)

 ### change all tax ranks to "unclassified_*" when
     # the bootstrap values is less than the specified threshold
 #kingdom
 tax_filtered = tax_filtered %>% mutate(kingdom = ifelse(kingdom_BootS <
     threshold, paste0("unclassified_", superkingdom), as.character(kingdom)))
 #phylum
 tax_filtered = tax_filtered %>% mutate(phylum = ifelse(phylum_BootS <
     threshold, paste0("unclassified_", kingdom), as.character(phylum)))
 #replace potential "unclassified_unclassified_" with "unclassified_"
 tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_",
                                                                         "unclassified_")
 #class
 tax_filtered = tax_filtered %>% mutate(class = ifelse(class_BootS <
     threshold, paste0("unclassified_", phylum), as.character(class)))
 #replace potential "unclassified_unclassified_" with "unclassified_"
 tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_",
                                                                         "unclassified_")
 #order
 tax_filtered = tax_filtered %>% mutate(order = ifelse(order_BootS <
     threshold, paste0("unclassified_", class), as.character(order)))
 #replace potential "unclassified_unclassified_" with "unclassified_"
 tax_filtered$order = stringr::str_replace(tax_filtered$order, "unclassified_unclassified_",
                                                                         "unclassified_")
 #family
 tax_filtered = tax_filtered %>% mutate(family = ifelse(family_BootS <
     threshold, paste0("unclassified_", order), as.character(family)))
 #replace potential "unclassified_unclassified_" with "unclassified_"
 tax_filtered$family = stringr::str_replace(tax_filtered$family, "unclassified_unclassified_",
                                                                         "unclassified_")
 #genus
 tax_filtered = tax_filtered %>% mutate(genus = ifelse(genus_BootS <
     threshold, paste0("unclassified_", family), as.character(genus)))
 #replace potential "unclassified_unclassified_" with "unclassified_"
 tax_filtered$genus = stringr::str_replace(tax_filtered$genus, "unclassified_unclassified_",
                                                                         "unclassified_")

 # species to genus_sp when the bootstrap values is < "species_bootstrap" threshold
 tax_filtered = tax_filtered %>% mutate(species = ifelse(species_BootS < species_bootstrap,
                                                     paste0(genus, "_sp"), species))

 ### count occurrences of each taxon in df (RDP results)
 count_taxa = function(df, taxa) {
 sapply(taxa, function(taxon) sum(apply(df, 1, function(row) any(row == taxon))))
 }
 taxon_counts = count_taxa(tax_filtered, taxon_list)

 # Check the counts
 if (all(taxon_counts == 0)) {
     print("ERROR: None of the specified taxa are present in the RDP results.")
 } else {
     if (any(taxon_counts == 0)) {
     warning("One or more of the specified taxa are not present in the RDP results.")
     }
     print(taxon_counts)
 }

 ### extract only target-taxon ASVs from the 'threshold filtered' RDP results
 tax_filtered_thresh = tax_filtered %>%
     filter(.data[[tax_level]] %in% taxon_list)
 # write filtered RDP taxonomy table
 tax_filtered_thresh = cbind(ASV = rownames(tax_filtered_thresh), tax_filtered_thresh)
 write.table(tax_filtered_thresh,
             file = "RDP.taxonomy.filt.txt",
             quote = F,
             row.names = F,
             sep = "\t")

 ### filter the ASV table to match ASVs that were kept in the tax_filtered table
 table_filt = table[rownames(table) %in% rownames(tax_filtered_thresh), ]

 ### check ASV table; if 1st col is sequence, then remove it for metaMATE
 if (colnames(table_filt)[1] == "Sequence") {
     cat(";; 2nd column was 'Sequence', removing this ... \n")
     table_filt = table_filt[, -1]
 }

 # write filtered table
 table_filt = cbind(ASV = rownames(table_filt), table_filt)
 write.table(table_filt,
             file = paste0(sub("\\.[^.]*$", "_tax_filt.txt", ASV_table)),
             quote = F,
             row.names = F,
             sep = "\t")

 # filter ASV_fasta
 library(Biostrings)
 fasta = readDNAStringSet(ASV_fasta)
 fasta.tax_filt = fasta[names(fasta) %in% rownames(table_filt)]
 # write filtered ASV_fasta
 writeXStringSet(fasta.tax_filt,
                 paste0(sub("\\.[^.]*$", "_tax_filt.fasta", ASV_fasta)),
                 width = max(width(fasta.tax_filt)))