.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png
:width: 300
:alt: Alternative text
:target: https://biodiversitygenomics.eu/
.. |eufund| image:: _static/eu_co-funded.png
:width: 200
:alt: Alternative text
.. |chfund| image:: _static/ch-logo-200x50.png
:width: 210
:alt: Alternative text
.. |ukrifund| image:: _static/ukri-logo-200x59.png
:width: 150
:alt: Alternative text
.. |logo_BGE_small| image:: _static/logo_BGE_alpha.png
:width: 120
:alt: Alternative text
:target: https://biodiversitygenomics.eu/
.. raw:: html
.. role:: red
.. raw:: html
.. role:: yellow-background
|logo_BGE_alpha|
Other taxonomy assignment tools
*******************************
`BOLDdigger3 `_
=============================================================
BOLDigger3 automates the taxonomic
identification of DNA sequences against the Barcode of Life Data System (BOLD v5) databases,
using BOLD's identification service.
.. code-block:: bash
:caption: BOLDdigger3
:linenos:
#!/bin/bash
# specify the query fasta file
fasta=$"ASVs.fasta"
# run BOLDdigger3
boldigger3 identify $fasta --db 2 --mode 3 --thresholds 97 95 90 85
## Or run BOLDdigger3 by specifying the python version if you have multiple ones installed
# py -3.13 -m boldigger3 identify cox1renamedkoopia.fasta --db 2 --mode 3 --thresholds 97 95
| ``--db`` **2** specifies the BOLD v5 database of "**ANIMAL SPECIES-LEVEL LIBRARY (PUBLIC + PRIVATE)**",
| ``--mode`` **3** specifies the mode of identification if "**Exhaustive Search**".
| ``--thresholds`` **97 95 90 85** specify the thresholds for Species-, Genus-, Family-, and Order-level identifications.
Check the `BOLDdigger3 documentation `_ for more details
about the options.
___________________________________________________
`BLAST `_
========================================================
| Reference database is a fasta file; no special formatting is required.
.. code-block:: bash
:caption: BLAST
:linenos:
#!/bin/bash
# specify the query fasta file
fasta=$"ASVs.fasta"
# specify reference database for BLAST
reference_database="../CO1Classifier_v5.1.0/CO1Classifier_v5.1.0.fasta"
reference_database=$(realpath $reference_database) # get full directory path
## if the database is just in fasta format, then convert it to BLAST format
### Check and assign BLAST database
d1=$(echo $reference_database | awk 'BEGIN{FS=OFS="."}{print $NF}')
#make blast database if db is not formatted for BLAST
db_dir=$(dirname $reference_database)
check_db_presence=$(ls -1 $db_dir/*.nhr 2>/dev/null | wc -l)
if (( $check_db_presence != 0 )); then
if [[ $d1 == "fasta" ]] || [[ $d1 == "fa" ]] || \
[[ $d1 == "fas" ]] || [[ $d1 == "fna" ]] || \
[[ $d1 == "ffn" ]]; then
database=$"-db $reference_database"
elif [[ $d1 == "ndb" ]] || [[ $d1 == "nhr" ]] || \
[[ $d1 == "nin" ]] || [[ $d1 == "not" ]] || \
[[ $d1 == "nsq" ]] || [[ $d1 == "ntf" ]] || \
[[ $d1 == "nto" ]]; then
reference_database=$(echo $reference_database | awk 'BEGIN{FS=OFS="."}NF{NF-=1};1')
database=$"-db $reference_database"
fi
elif [[ $d1 == "fasta" ]] || [[ $d1 == "fa" ]] || \
[[ $d1 == "fas" ]] || [[ $d1 == "fna" ]] || \
[[ $d1 == "ffn" ]]; then
printf '%s\n' "Note: converting fasta formatted database for BLAST"
makeblastdb -in $reference_database -input_type fasta -dbtype nucl
database=$"-db $reference_database"
fi
#BLAST
printf '%s\n' "# Running BLAST for $(grep -c "^>" $fasta) sequences"
blastn -strand plus \
-num_threads 20 \
-query $fasta \
$database \
-out 10BestHits.txt -task blastn \
-max_target_seqs 10 -evalue=0.001 \
-word_size=7 -reward=1 \
-penalty=-1 -gapopen=1 -gapextend=2 \
-outfmt "6 qseqid stitle qlen slen qstart qend sstart send evalue length nident mismatch gapopen gaps sstrand qcovs pident"
#qseqid = Query Seq-id
#qlen = Query sequence length
#sacc = Subject accession
#slen = Subject sequence length
#qstart = Start of alignment in query
#qend = End of alignment in query
#sstart = Start of alignment in subject
#send = End of alignment in subject
#evalue = Expect value
#length = Alignment length
#pident = Percentage of identical matches
#nident = Number of identical matches
#mismatch = Number of mismatches
#gapopen = Number of gap openings
#gaps = Total number of gaps
#1st_hit = BLAST 1st hit
#sstrand = Subject Strand
#qcovs = Query Coverage Per Subject
### parse BLAST 1st hit
awk 'BEGIN{FS="\t"}''!seen[$1]++' 10BestHits.txt > BLAST_1st_hit.txt
#check which seqs got a hit
gawk 'BEGIN{FS="\t"}{print $1}' < BLAST_1st_hit.txt | \
uniq > gothits.names
#add no_hits flag
seqkit seq -n $fasta > $fasta.names
grep -v -w -F -f gothits.names $fasta.names | \
sed -e 's/$/\tNo_significant_similarity_found/' >> BLAST_1st_hit.txt
#add header
sed -i '1 i\
qseqid\t1st_hit\tqlen\tslen\tqstart\tqend\tsstart\tsend\tevalue\tlength\tnident\tmismatch\tgapopen\tgaps\tsstrand\tqcovs\tpident' \
BLAST_1st_hit.txt
#remove unnecessary files
rm *.names
`RDP-classifier `_
====================================================================
| Taxonomy assignment with the RDP-classifier against `CO1Classifier v5.1.0 database. `_
| **---** `Download the CO1Classifier v5.1.0 for RDP here (click) `_ **---**
.. code-block:: bash
:caption: assign taxonomy with RDP-classifier
:linenos:
#!/bin/bash
# download the CO1Classifier reference databse
wget \
"https://github.com/terrimporter/CO1Classifier/releases/download/RDP-COI-v5.1.0/RDP_COIv5.1.0.zip"
# unzip the database and edit name
unzip RDP_COIv5.1.0.zip && mv mydata CO1Classifier_v5.1.0_RDP
# specify reference database for RDP
reference_database="CO1Classifier_v5.1.0_RDP/rRNAClassifier.properties"
reference_database=$(realpath $reference_database) # get database names with full path
# specify input fasta file
cd ASV_table
ASV_fasta="ASVs_TagJumpFiltered.fasta"
ASV_fasta_tmp="ASVs_TagJumpFiltered_minmax.fasta"
# select by size to only retain ASVs that are within the range of
# expected variation. In this case we set it to 400 up to 430 bps
vsearch --fastx_filter $ASV_fasta \
--fastq_minlen 400 \
--fastq_maxlen 430 \
--fastaout $ASV_fasta_tmp
mv $ASV_fasta_tmp $ASV_fasta
# Run RDP-classifier
time rdp_classifier \
-Xmx12g \
classify \
-t $reference_database \
-f allrank \
-o RDP.taxonomy.txt \
-q $ASV_fasta
____________________________________________________
Get target taxa from RDP-classifier results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| This part filters the ASV dataset to include only target taxonomic group for the following analyses.
| For example, if you are interested in Metazoa, then discard all ASVs that do not match to the target taxon based on the user defined threshold (default = 0.8).
.. code-block:: R
:caption: get only target taxon annotations
:linenos:
#!/usr/bin/env Rscript
### Filter dataset based on RDP classifier results to include target taxa
# specify taxon and threshold
taxon="Metazoa" # target taxonomic group(s);
# when specifying multiple groups, they should be from the same taxonomic
# level. Separator is "," (e.g., "Hymenoptera, Lepidoptera")
tax_level="kingdom" # allowed levels: kingdom | phylum | class | order | family | genus
threshold="0.8" # threshold for considering an ASV as a target taxon;
# default is 0.8
# species_bootstrap = 0.9 # threshold for species-level identification;
# default is 0.9
# specify the ASV table and ASVs.fasta file that would be filtered to include only target taxa
ASV_fasta = "ASVs_TagJumpFiltered.fasta"
ASV_table = "ASV_table_TagJumpFiltered.txt"
# specify the RDP-classifier output file (taxonomy file)
taxtab="RDP.taxonomy.txt"
#--------------------------------------#
library(stringr)
library(dplyr)
# read ASV table
table = read.table(ASV_table, sep = "\t", check.names = F, header = T, row.names = 1)
# read taxonomy table
tax = read.table(taxtab, sep = "\t", check.names = F, row.names = 1)
cat("\n Input =", nrow(tax), "features.\n")
# remove not needed columns from tax dataframe
tax = tax[, -c(1, 2, 3, 4, 6, 9, 12, 15, 18, 21, 24, 27)]
# assign colnames for tax
colnames(tax) = c("superkingdom", "superkingdom_BootS",
"kingdom", "kingdom_BootS",
"phylum","phylum_BootS",
"class", "class_BootS",
"order", "order_BootS",
"family", "family_BootS",
"genus", "genus_BootS",
"species", "species_BootS")
# taxon list
taxon_list = strsplit(taxon, ", ")[[1]]
### extract only target-taxon ASVs from the 'raw' RDP results
tax_filtered = tax %>%
filter(.data[[tax_level]] %in% taxon_list)
### change all tax ranks to "unclassified_*" when
# the bootstrap values is less than the specified threshold
#kingdom
tax_filtered = tax_filtered %>% mutate(kingdom = ifelse(kingdom_BootS <
threshold, paste0("unclassified_", superkingdom), as.character(kingdom)))
#phylum
tax_filtered = tax_filtered %>% mutate(phylum = ifelse(phylum_BootS <
threshold, paste0("unclassified_", kingdom), as.character(phylum)))
#replace potential "unclassified_unclassified_" with "unclassified_"
tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_",
"unclassified_")
#class
tax_filtered = tax_filtered %>% mutate(class = ifelse(class_BootS <
threshold, paste0("unclassified_", phylum), as.character(class)))
#replace potential "unclassified_unclassified_" with "unclassified_"
tax_filtered$class = stringr::str_replace(tax_filtered$class, "unclassified_unclassified_",
"unclassified_")
#order
tax_filtered = tax_filtered %>% mutate(order = ifelse(order_BootS <
threshold, paste0("unclassified_", class), as.character(order)))
#replace potential "unclassified_unclassified_" with "unclassified_"
tax_filtered$order = stringr::str_replace(tax_filtered$order, "unclassified_unclassified_",
"unclassified_")
#family
tax_filtered = tax_filtered %>% mutate(family = ifelse(family_BootS <
threshold, paste0("unclassified_", order), as.character(family)))
#replace potential "unclassified_unclassified_" with "unclassified_"
tax_filtered$family = stringr::str_replace(tax_filtered$family, "unclassified_unclassified_",
"unclassified_")
#genus
tax_filtered = tax_filtered %>% mutate(genus = ifelse(genus_BootS <
threshold, paste0("unclassified_", family), as.character(genus)))
#replace potential "unclassified_unclassified_" with "unclassified_"
tax_filtered$genus = stringr::str_replace(tax_filtered$genus, "unclassified_unclassified_",
"unclassified_")
# species to genus_sp when the bootstrap values is < "species_bootstrap" threshold
tax_filtered = tax_filtered %>% mutate(species = ifelse(species_BootS < species_bootstrap,
paste0(genus, "_sp"), species))
### count occurrences of each taxon in df (RDP results)
count_taxa = function(df, taxa) {
sapply(taxa, function(taxon) sum(apply(df, 1, function(row) any(row == taxon))))
}
taxon_counts = count_taxa(tax_filtered, taxon_list)
# Check the counts
if (all(taxon_counts == 0)) {
print("ERROR: None of the specified taxa are present in the RDP results.")
} else {
if (any(taxon_counts == 0)) {
warning("One or more of the specified taxa are not present in the RDP results.")
}
print(taxon_counts)
}
### extract only target-taxon ASVs from the 'threshold filtered' RDP results
tax_filtered_thresh = tax_filtered %>%
filter(.data[[tax_level]] %in% taxon_list)
# write filtered RDP taxonomy table
tax_filtered_thresh = cbind(ASV = rownames(tax_filtered_thresh), tax_filtered_thresh)
write.table(tax_filtered_thresh,
file = "RDP.taxonomy.filt.txt",
quote = F,
row.names = F,
sep = "\t")
### filter the ASV table to match ASVs that were kept in the tax_filtered table
table_filt = table[rownames(table) %in% rownames(tax_filtered_thresh), ]
### check ASV table; if 1st col is sequence, then remove it for metaMATE
if (colnames(table_filt)[1] == "Sequence") {
cat(";; 2nd column was 'Sequence', removing this ... \n")
table_filt = table_filt[, -1]
}
# write filtered table
table_filt = cbind(ASV = rownames(table_filt), table_filt)
write.table(table_filt,
file = paste0(sub("\\.[^.]*$", "_tax_filt.txt", ASV_table)),
quote = F,
row.names = F,
sep = "\t")
# filter ASV_fasta
library(Biostrings)
fasta = readDNAStringSet(ASV_fasta)
fasta.tax_filt = fasta[names(fasta) %in% rownames(table_filt)]
# write filtered ASV_fasta
writeXStringSet(fasta.tax_filt,
paste0(sub("\\.[^.]*$", "_tax_filt.fasta", ASV_fasta)),
width = max(width(fasta.tax_filt)))
____________________________________________________
|logo_BGE_small| |eufund| |chfund| |ukrifund|