.. |logo_BGE_alpha| image:: _static/logo_BGE_alpha.png :width: 300 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. |eufund| image:: _static/eu_co-funded.png :width: 200 :alt: Alternative text .. |chfund| image:: _static/ch-logo-200x50.png :width: 210 :alt: Alternative text .. |ukrifund| image:: _static/ukri-logo-200x59.png :width: 150 :alt: Alternative text .. |logo_BGE_small| image:: _static/logo_BGE_alpha.png :width: 120 :alt: Alternative text :target: https://biodiversitygenomics.eu/ .. raw:: html .. role:: red |logo_BGE_alpha| .. _data_sharing: FAIR publication of eDNA results ******************************** A sequence (and/or raw sequence data) with coordinates and a timestamp is a **valuable biodiversity occurrence** that can be useful beyond its original purpose. To realize this potential, DNA-derived data must be discoverable through biodiversity data platforms (see more from `Abarenkov et al., 2025 `_). **FAIR (Findable, Accessible, Interoperable, and Reusable)** publication of eDNA results can be achieved by depositing samples (with metadata) and sequencing data in public repositories such as `European Nucleotide Archive (ENA) `_, `PlutoF `_ and `GBIF `_. This section describes the steps for publishing DNA-derived data in these repositories. __________________________________________________ .. _uploading_sample_records_to_ena: Uploading sample records to ENA =============================== `ENA (European Nucleotide Archive) `_ is an internationally recognized public repository for nucleotide sequence data and associated sample metadata, ensuring your data is **FAIR**. If the data is already registered in `PlutoF `_ , it can be uploaded to ENA using the **Publishing Lab** in PlutoF (see section :ref:`registering_samples_in_plutof`). .. note:: If the data is not registered in PlutoF, it can be uploaded to ENA manually (see `ENA documentation `_). When submitting metabarcoding data from environmental samples, then ``tax_id`` value is **256318**; and ``scientific_name`` is **metagenome**. Using the **Publishing Lab** in PlutoF, users can submit sample records to the ENA database. The PlutoF platform acts as a broker for ENA, utilising its programmatic Webin submission service for sample data submission. The resulting BioSamples (ENA IDs) identifiers are stored in PlutoF alongside the original sample records. All samples within a selected study (BioProject) can be submitted together, and the dataset can be updated later by re-publishing. .. figure:: _static/plutof/to_ENA.png :width: 650 :align: center *To publish your dataset in ENA, go to Main menu -> Laboratories -> Publishing Lab -> ENA Datasets -> New* **Steps to publish:** | **1.** Select the project name from the autocomplete list (project **moderator rights** are required). | **2.** Fill in any missing mandatory field values as required by ENA. | **3.** Save the dataset. | **4.** After administrator approval, publish the dataset to ENA. .. note:: Samples uploaded to ENA are treated as independent samples (BioSamples) - that is, they are not linked to a BioProject. **Samples will be linked to a BioProject when the raw sequence data is associated with the samples.** See 'Uploading raw sequence data to ENA' below. ___________________________________________________ .. _upload_raw_sequence_data_to_ena: Uploading raw sequence data to ENA ================================== When the samples are uploaded to ENA through the PlutoF platform, the raw sequence data can be linked to them. For that, first a **BioProject** needs to be created (:ref:`Register a study `), followed by the :ref:`sequencing data submission `. .. _register_study: Registering a study ------------------- **To register a study**, go to `ENA website `_ -> log in -> **Register Study** (Study is is also referred to as project/BioProject) .. figure:: _static/register_study.png :width: 650 :align: center Study can be registered also programmatically, `see here `_. Specify the **"Study Name"**, **"Short descriptive study title"**, **"Detailed study abstract"** and **"Release date"** of the study. .. figure:: _static/register_study2.png :width: 650 :align: center .. admonition:: Example - **Study Name**: "BGE High Mountain Systems" - **Short descriptive study title**: "High Mountain Systems case study within Biodiversity Genomics Europe project: arthropod community monitoring along altitudinal gradients using Malaise Traps" - **Detailed study abstract**: "This study is part of the Biodiversity Genomics Europe (BGE) initiative. The project evaluates spatial and temporal variation in arthropod diversity along altitudinal gradients. Sampling was conducted in mountain ranges across seven European countries. Within each country, five elevation sites are selected along an altitudinal gradient. At each elevation, Townes-style Malaise traps were deployed and continuously sampled for 20 consecutive weeks in 2023. Samples were preserved in 96% ethanol. COI amplicons were generated with BF3 (CCHGAYATRGCHTTYCCHCG) and BR2 (CDGGRTGNCCRAARAAYCA) primers. Laboratory protocols are available at https://bioscanflow.readthedocs.io" - **Release date**: "2026-01-01" `Click here to open ENA user guide for registering a study `_. ___________________________________________________ .. _submit_sequencing_data: Submitting sequencing data -------------------------- After the study (BioProject) and samples (BioSamples) are registered, the sequencing data can be submitted. .. note:: Samples were registered in ENA through the PlutoF platform. So, we have BioSample codes for the samples that are now associated with the sequencing data. See here :ref:`how to export the sample IDs (Material Sample IDs) and the BioSample IDs ` from PlutoF. **Steps to submit sequencing data:** 1. :ref:`Upload the sequencing data to ENA via FTP ` 2. :ref:`Download and fill in the spreadsheet template ` 3. :ref:`Upload the filled spreadsheet template to ENA ` ____________________________________________________ .. _upload_sequencing_data_to_ena: **1. Upload the sequencing data to ENA via FTP** There are sever ways to submit the sequencing data to ENA (see `ENA documentation `_). Here, we will use the **FTP upload** method. .. note:: In Windows, **Windows Subsystem for Linux (WSL)** is recommended for uploading the sequencing data via FTP as built-in Windows FTP client may be problematic. See `here `_ for installation instructions. Once installed, you can use the following command to upload the sequencing data to ENA via FTP. .. code-block:: bash :caption: Upload the sequencing data to ENA via FTP # 1. Navigat to the DIRECTORY (folder) where the fastq files are located cd /path/to/fastq/files # 2. Upload the fastq files to ENA via FTP ftp webin.ebi.ac.uk # then enter your ENA username (Webin-#####) and password # 3. upload all fastq files in the current directory # Note that fastq files must be gzipped prompt # upload without confirmation for each file mput *.fastq.gz # upload all fastq files in the current directory # 4. exit FTP client when uploads are complete bye __________________________________________________ .. _download_and_fill_in_spreadsheet_template: **2. Download and fill in the spreadsheet template** Sequences should be submitted to ENA (step above) before the submission of the spreadsheet template. **To now link the uploaded sequencing data to the samples (BioSamples)**, go to `ENA website `_ -> log in -> **Submit Reads** .. figure:: _static/submit_sequencing_data.png :width: 650 :align: center | Clicking on **Submit Reads** will open the following window: .. figure:: _static/submit_sequencing_data2.png :width: 650 :align: center | **Download spreadsheet template for Read submission**. Here, we select ``Submit paired reads using two Fastq files`` since we have **Illumina paired-end data** *(select an appropriate option based on the type of data you are submitting)*. Fill in the **tsv** (tab-separated) spreadsheet with the required data (**use only valid ASCII characters**). In the example below, the values for the **sample** represent the BioSample codes that have been obtained when samples were submitted to ENA via the PlutoF platform. The **study** column contains the BioProject code for the study we generated in the previous step. +-------------------+-------------------------------------------------------------------+ | Field | Description | +===================+===================================================================+ | sample | BioSample codes | +-------------------+-------------------------------------------------------------------+ | study | BioProject (study) code | +-------------------+-------------------------------------------------------------------+ | instrument_model | sequencing instrument model (here: "Illumina NovaSeq 6000") | +-------------------+-------------------------------------------------------------------+ || library_name || library name (herein those refer to the inhouse sample names, | || || e.g. "BGE.HMS0001") | +-------------------+-------------------------------------------------------------------+ | library_source | library source (here: "METAGENOMIC") | +-------------------+-------------------------------------------------------------------+ | library_selection | library selection (here: "PCR") | +-------------------+-------------------------------------------------------------------+ | library_strategy | library strategy (here: "AMPLICON") | +-------------------+-------------------------------------------------------------------+ | library_layout | library layout (here: "PAIRED") | +-------------------+-------------------------------------------------------------------+ | forward_file_name | forward fastq file name | +-------------------+-------------------------------------------------------------------+ || forward_file_md5 || 32-digit hexadecimal numbers for upload verification for forward | || || fastq file. **See below code how to calculate the MD5 checksum** | +-------------------+-------------------------------------------------------------------+ | reverse_file_name | reverse fastq file name | +-------------------+-------------------------------------------------------------------+ || reverse_file_md5 || 32-digit hexadecimal numbers for upload verification for reverse | || || fastq file. **See below code how to calculate the MD5 checksum** | +-------------------+-------------------------------------------------------------------+ .. code-block:: bash :caption: Calculate the MD5 checksum for the fastq files # navigate to the directory where the fastq files are located cd /path/to/fastq/files # calculate the MD5 checksum for the forward fastq file for f in *.R1.fastq.gz; do md5sum $f | \ awk '{ gsub("*", ""); print $2"\t" $1 }'; done > R1_md5sums.txt # calculate the MD5 checksum for the reverse fastq file for f in *.R2.fastq.gz; do md5sum $f | \ awk '{ gsub("*", ""); print $2"\t" $1 }'; done > R2_md5sums.txt # open the MD5 checksum files and transfer the values to the spreadsheet template | Click on the image below to enlarge the example of a filled spreadsheet template. .. figure:: _static/ENA_fastq_template.png :width: 690 :align: center .. important:: **Always use the original spreadsheet template provided by ENA.** Do not delete any rows or adjust the order of columns in the spreadsheet template. Doing so, may result in submission errors. .. admonition:: Multiple R1 and R2 fastq files per sample It is common that the **samples have multiple sequencing runs** (i.e., one sample has more than just one R1 and R2 fastq file). In this case, the spreadsheet template should be filled as follows, where a **sample that is associated with multiple sequencing runs** is represented by **multiple rows**. *Click on the image below to enlarge the example of a filled spreadsheet template* .. figure:: _static/ENA_fastq_template2.png :width: 690 :align: center __________________________________________________ .. _upload_filled_spreadsheet_template_to_ena: **3. Upload the filled spreadsheet template to ENA** Once the spreadsheet is filled, it can be uploaded to ENA. **Submit Reads** --> **Upload filled spreadsheet template for Read submission** --> **Submit Completed Spreadsheet** .. figure:: _static/upload_fastq_tsv.png :width: 690 :align: center | If everything is correct, the "The submission was successful" message will be displayed. .. figure:: _static/submission_result.png :width: 470 :align: center Even when the Study is public, it **may take few days** for the sequences to be available in ENA. `Click here to open ENA user guide for submitting sequencing data `_. __________________________________________________ .. _upload_sequences_to_plutof: Uploading representative sequences of metabarcoding features to PlutoF ====================================================================== Representative sequences of metabarcoding features (OTUs/ASVs) and their taxonomy (and other information, such as read counts) per sample can be uploaded to `PlutoF `_. Then each feature is tied to a specific sample (Material Sample ID) that connects the sequence data to where and when it was collected, which enhances data reuse and downstream data sharing. Steps to upload representative sequences: 1. :ref:`Download the template file for the import `. 2. :ref:`Fill in the template file `. 3. :ref:`Upload the template file `. __________________________________________________ .. _download_sequence_template_file: **1. Download the template file for the import** A template file for the **representative sequence import** can be created and downloaded via **Import panel** --> Generate template by selecting the **Module "Sequence"** and **Form name "Sequence: HTS representative"**. Select required filelds (minimum **Project**, **Type**, and **ID**) .. figure:: _static/plutof/sequence_template.png :width: 690 :align: center | .. _fill_in_sequence_template: **2. Fill in the template file.** Fill in the template file with the representative sequences of ASVs/OTUs per sample. +--------------------------------------+--------------------------------------------+ | Field | Description | +======================================+============================================+ | Linked to.Project | project name in PlutoF | +--------------------------------------+--------------------------------------------+ | Linked to.Type | here the type is **materialsample** | +--------------------------------------+--------------------------------------------+ | Linked to.ID | Material Sample ID | +--------------------------------------+--------------------------------------------+ | Sequence ID | Feature (ASV/OTU) Sequence ID | +--------------------------------------+--------------------------------------------+ | Sequence | Feature (ASV/OTU) Sequence | +--------------------------------------+--------------------------------------------+ | Sampling event.Sampling area.Country | Country where the sample was collected | +--------------------------------------+--------------------------------------------+ | Determination.Taxon name | Assigned taxonomy of the feature (ASV/OTU) | +--------------------------------------+--------------------------------------------+ | Read count.Value | Read count of the feature (ASV/OTU) | +--------------------------------------+--------------------------------------------+ *Click on the image below to enlarge the example of a filled sequence template (for one sample).* .. figure:: _static/plutof/filled_sequence_template.png :width: 690 :align: center .. note:: Leave the column ``Sampling event.Sampling area.Country`` **empty** to automatically associate the sequence with the sample coordinates in PlutoF. __________________________________________________ .. _upload_sequence_template_file: **3. Upload the template file** Once the template file is filled, it can be uploaded to PlutoF via the **Import panel**. **Import panel** --> **New** (new import process) .. figure:: _static/plutof/import_new.png :width: 690 :align: center | **Upload** the filled template file, - specify **Module** ``Sequence`` - **Form name** ``Sequence: HTS representative`` - **Project** ``Your project name`` - **Check the box** for "Use source record's area and event if event columns are empty" - **Match project sampling areas** - --> **Save and start** .. figure:: _static/plutof/save_and_start.png :width: 690 :align: center .. note:: .. figure:: _static/plutof/error_parsing_source.png :width: 400 :align: center When this **ERROR** message is displayed, save your CSV file as **CSV UTF-8**. In Excel, select **File** --> **Save As** --> **CSV (UTF-8)**. Try the file upload again. | After clicking the ``Save and start`` button, the interactive import process will begin and guide the user through the procedure. You will be notified if any **errors** occur during the import process. The likely case is you need to fix the **synonyms**. This can be done interactively (see below image). When edited, then press **Save and continue** to proceed with the import process. .. figure:: _static/plutof/fix_synonyms.png :width: 500 :align: center | When the import is complete, you may press the ``Back`` button. .. figure:: _static/plutof/import_complete.png :width: 700 :align: center | .. admonition:: DONE The sequences are accosiated with the sample(s). .. figure:: _static/plutof/related_records.png :width: 700 :align: center | __________________________________________________ .. _publish_sequences_in_gbif: Publishing metabarcoding features in GBIF ========================================= Submitting metabarcoding biodiversity data to `GBIF `_ makes it **publicly discoverable and reusable** (the data becomes citable with a DOI). GBIF does not store sequence reads (those belong in repositories like ENA for raw data, and ASV/OTU representative sequences to PlutoF). Instead, GBIF hosts **taxonomy assignments** (taxonomic name for an ASV/OTU) and **sampling event metadata** (location, time, etc). Herein, the **prerequisite** for publishing metabarcoding features (ASVs/OTUs) in GBIF is to have the representative sequences uploaded to PlutoF (*see section* :ref:`upload_sequences_to_plutof`). Steps to publish in metabarcoding biodiversity data in GBIF through PlutoF: **1. Create a new GBIF dataset in PlutoF.** .. figure:: _static/plutof/new_gbif_dataset.png :width: 690 :align: center | .. admonition:: Choosing the license When creating a new GBIF dataset, you need to choose a license. The license can be chosen from the following options: 1. **CC0** - choose when you want to share the data **without any restrictions**. 2. **CC-BY** - choose when you want to share the data with **attribution to the original authors**. 3. **CC-BY-NC** - choose when you want to share the data with **attribution to the original authors**, but **not for commercial use**. **2. Fill the and save the new GBIF dataset form.** .. figure:: _static/plutof/fill_gbif_form.png :width: 690 :align: center | **3. Search for your Sequences (metabarcoding features) in PlutoF, and send them to the clipboard.** .. figure:: _static/plutof/send_to_clipboard2.png :width: 690 :align: center | **4. Got to Clipboard & Export panel, and click on "Sequences"** .. figure:: _static/plutof/clipboard_seqs.png :width: 690 :align: center | **5. Select all (or some) sequences --> GBIF Publishing --> Append/Overwrite --> Specify GBIF dataset** .. figure:: _static/plutof/gbif_publishing.png :width: 690 :align: center | **6. Go back to Publishing panel --> GBIF Datasets.** Before publishing, **check the dataset** that is subjected to publishing in GBIF. - Press in the button ``Generate DwCA`` to generate the Darwin Core Archive (DwCA) file. - Download the DwCA file *(History panel within current GBIF dataset)*. - Check if all looks good. .. figure:: _static/plutof/DcWA.png :width: 690 :align: center | If everything looks good, press in the button ``Publish to GBIF``. .. figure:: _static/plutof/publish_to_gbif.png :width: 200 :align: center | In `GBIF `_, you can access your dataset can be accessed via the **DATASETS** panel. .. figure:: _static/plutof/gbif_datasets.png :width: 690 :align: center | ____________________________________________________ |logo_BGE_small| |eufund| |chfund| |ukrifund|