Clinical Metaproteiomics 1: Database-Generation

Overview
Creative Commons License: CC-BY Questions:
  • Why do we need to generate a customized database for metaproteomics research?

  • How do we reduce the size of the database?

Objectives:
  • Downloading databases related to 16SrRNA data

  • For better identification results, combine host and microbial proteins.

  • Reduced database provides better FDR stats.

Requirements:
Time estimation: 3 hours
Supporting Materials:
Published: Jan 15, 2025
Last modification: Jan 15, 2025
License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MIT
purl PURL: https://gxy.io/GTN:T00460
version Revision: 0

Metaproteomics is the large-scale characterization of the entire complement of proteins expressed by microbiota. However, metaproteomics analysis of clinical samples is challenged by the presence of abundant human (host) proteins which hampers the confident detection of lower abundant microbial proteins Batut et al. 2018 ; Jagtap et al. 2015 .

To address this, we used tandem mass spectrometry (MS/MS) and bioinformatics tools on the Galaxy platform to develop a metaproteomics workflow to characterize the metaproteomes of clinical samples. This clinical metaproteomics workflow holds potential for general clinical applications such as potential secondary infections during COVID-19 infection, microbiome changes during cystic fibrosis as well as broad research questions regarding host-microbe interactions.

Clinical Metaproteomics workflow.

The first workflow for the clinical metaproteomics data analysis is the Database generation workflow. The Galaxy-P team has developed a workflow wherein a large database is generated by downloading protein sequences of known disease-causing microorganisms and then generating a compact database from the comprehensive database using the Metanovo tool.

Database Generation Workflow.

Agenda

In this tutorial, we will cover:

  1. Data Upload
    1. Get data
  2. Import Workflow
  3. Step-by-step analysis
    1. Download Protein Sequences using taxon names
    2. Download Species Protein Sequences using UniProt XML downloader with UniProt
    3. Merging databases to obtain a large comprehensive database for MetaNovo
  4. Reducing Database size
    1. Metanovo tool generates a compact database from your comprehensive database with MetaNovo
    2. Merging databases to obtain reduced MetaNovo database for peptide discovery with FASTA Merge Files and Filter Unique Sequences
  5. Conclusion

Data Upload

Get data

Hands-on: Data Upload
  1. Create a new history for this tutorial
  2. Import the files from Zenodo or from the shared data library (GTN - Material -> microbiome -> Clinical Metaproteiomics 1: Database-Generation):

    https://zenodo.org/records/10105821/files/HUMAN_SwissProt_Protein_Database.fasta
    https://zenodo.org/records/10105821/files/Species_UniProt_FASTA.fasta
    https://zenodo.org/records/10105821/files/Contaminants_(cRAP)_Protein_Database.fasta
    https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F10_9Aug19_Rage_Rep-19-06-08.mgf
    https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F11_9Aug19_Rage_Rep-19-06-08.mgf
    https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F13_9Aug19_Rage_Rep-19-06-08.mgf
    https://zenodo.org/records/10105821/files/PTRC_Skubitz_Plex2_F15_9Aug19_Rage_Rep-19-06-08.mgf
    
    • Copy the link location
    • Click galaxy-upload Upload Data at the top of the tool panel

    • Select galaxy-wf-edit Paste/Fetch Data
    • Paste the link(s) into the text field

    • Press Start

    • Close the window

    As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:

    1. Go into Data (top panel) then Data libraries
    2. Navigate to the correct folder as indicated by your instructor.
      • On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
    3. Select the desired files
    4. Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
    5. In the pop-up window, choose

      • “Select history”: the history you want to import the data to (or create a new one)
    6. Click on Import

  3. Rename the datasets
  4. Check that the datatype

    • Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
    • In the central panel, click galaxy-chart-select-data Datatypes tab on the top
    • In the galaxy-chart-select-data Assign Datatype, select datatypes from “New type” dropdown
      • Tip: you can start typing the datatype into the field to filter the dropdown menu
    • Click the Save button

  5. Optional-Add to each database a tag corresponding to the file name.
  6. Create a dataset collection of the 4 MGF datasets.

    Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.

    To tag a dataset:

    1. Click on the dataset to expand it
    2. Click on Add Tags galaxy-tags
    3. Add tag text. Tags starting with # will be automatically propagated to the outputs of tools using this dataset (see below).
    4. Press Enter
    5. Check that the tag appears below the dataset name

    Tags beginning with # are special!

    They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):

    1. a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
    2. dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for + and - strands. This generates two datasets (4 and 5 for plus and minus, respectively);
    3. datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
    4. datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.

    A history without name tags versus history with name tags

    Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.

    The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with #plus and #minus, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.

    More information is in a dedicated #nametag tutorial.

Import Workflow

Hands-on: Running the Workflow
  1. Import the workflow into Galaxy:

    Hands-on: Importing and launching a GTN workflow
    Launch Database Generation (View on GitHub, Download workflow) workflow.
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on galaxy-upload Import at the top-right of the screen
    • Paste the following URL into the box labelled “Archived Workflow URL”: https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/clinical-mp-1-database-generation/workflows/WF1_Database_Generation_Workflow.ga
    • Click the Import workflow button

    Below is a short video demonstrating how to import a workflow from GitHub using this procedure:

    Video: Importing a workflow from URL

  2. Run Workflow workflow using the following parameters:

    • “Send results to a new history”: No
    • param-file ” Input Dataset collection”: MGF dataset collection
    • param-file ” Species_tabular”: Species_tabular.tabular
    • Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
    • Click on the workflow-run (Run workflow) button next to your workflow
    • Configure the workflow as needed
    • Click the Run Workflow button at the top-right of the screen
    • You may have to refresh your history to see the queued jobs

Step-by-step analysis

Download Protein Sequences using taxon names

First, we want to generate a large comprehensive protein sequence database using the UniProt XML Downloader to extract sequences for species of interest. To do so, you will need a tabular file that contains a list of species.

For this tutorial, a literature survey was conducted to obtain 118 taxonomic species of organisms that are commonly associated with the female reproductive tract Afiuni-Zadeh et al. 2018. This species list was used to generate a protein sequence FASTA database was generated using the UniProt XML Downloader tool within the Galaxy framework. In this tutorial, the Species FASTA database (~3.38 million sequences) has already been provided as input. However, if you have your own list of species of interest as a tabular file (Your_Species_tabular.tabular), steps to generate a FASTA file from a tabular file are included:

Download Species Protein Sequences using UniProt XML downloader with UniProt

Hands-on: UniProt XML downloader
  1. UniProt ( Galaxy version 2.3.0) with the following parameters:
    • “Select”: Your_Species_tabular.tabular
      • param-file “Dataset (tab separated) with Taxon ID/Name column”: output (Input dataset)
      • “Column with Taxon ID/name”: c1
    • “UniProt output format”: fasta
  2. Rename the output as Species_UniProt_FASTA.fasta

    Comment: UniProt description

    This tool will help download the protein fasta sequences by inputting the taxon names.

Question
  1. Can we use a higher taxonomy clade than species for the UniProt XML downloader?
  2. Why are we using the tools separately? Can we run it all together?
  3. Can we select multiple files together?
  4. How many FASTA files can be merged at once, i.e. is there a limit on the number/size of files?
  1. Yes, the UniProt XML downloader can also be used for generating a database from Genus, Family, Order, or any other higher taxonomy clade.
  2. The tools are run separately to reduce the load on the server and tool. If you have a limited number of taxon names, then you can run it all together.
  3. Yes, that certainly can be done. We used one input file at a time to maintain the order of sequences in the database.
  4. There is no limit.

Merging databases to obtain a large comprehensive database for MetaNovo

Once generated, the Species UniProt database (~3.38 million sequences) will be merged with the Human SwissProt database (reviewed only; ~20.4K sequences) and contaminant (cRAP) sequences database (116 sequences) and filtered to generate the large comprehensive database (~2.59 million sequences). The large comprehensive database will be used to generate a compact database using MetaNovo, which is much more manageable.

Hands-on: Download contaminants with **Protein Database Downloader**
  1. Protein Database Downloader ( Galaxy version 0.3.4) with the following parameters:
    • “Download from?”: cRAP (contaminants)
  2. Rename as “Protein Database Contaminants (cRAP)”
Hands-on: Human SwissProt (reviewed) database
  1. Protein Database Downloader ( Galaxy version 0.3.4) with the following parameters:
    • “Download from?”: UniProtKB(reviewed only)
      • In “Taxonomy”: Homo sapiens (Human)
      • In “reviewed”: UniProtKB/Swiss-Prot (reviewed only)
      • In “Proteome Set”: Reference Proteome Set
      • In “Include isoform data”: False
  2. Rename as “Protein Database Human SwissProt”.
Question
  1. How often is the Protein Database Downloader updated?
  1. It is updated every 3 months.
Hands-on: FASTA Merge Files and Filter Unique Sequences
  1. FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:
    • “Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)
      • In “Input FASTA File(s)”:
        • param-repeat “Insert Input FASTA File(s)”
          • param-file “FASTA File”: Species_UniProt_FASTA (output of UniProt XML downloader tool)
          • param-file “FASTA File”: Protein Database Human SwissProt (output of Protein Database Downloader tool)
          • param-file “FASTA File”: Protein Database Contaminants (cRAP) (output of Protein Database Downloader tool)
  2. Rename out as “Human UniProt Microbial Proteins cRAP for MetaNovo”.

Reducing Database size

Metanovo tool generates a compact database from your comprehensive database with MetaNovo

Next, the large comprehensive database of ~2.59 million sequences can be reduced using the MetaNovo tool to generate a more manageable database that contains identified proteins.

The compact MetaNovo-generated database (~1.9K sequences) will be merged with Human SwissProt (reviewed only) and contaminants (cRAP) databases to generate the reduced database (~21.2k protein sequences) that will be used for peptide identification (see Discovery Module tutorial).

Hands-on: MetaNovo
  1. MetaNovo ( Galaxy version 1.9.4+galaxy4) with the following parameters:
    • “MGF Input Type”: Collection
      • param-collection “MGF Collection”: output (Input dataset collection)
    • param-file “FASTA File”: output (output of FASTA Merge Files and Filter Unique Sequences tool)
    • In “Spectrum Matching Parameters”:
      • “Fragment ion mass tolerance”: 0.01
      • “Enzyme”: Trypsin (no P rule)
      • “Fixed modifications as comma separated list”: Carbamidomethylation of C TMT 10-plex of K TMT 10-plex of peptide N-term
      • “Variable modifications as comma separated list”: Oxidation of M
      • “Maximal charge to search for”: 5
    • In “Import Filters”:
      • “The maximal peptide length to consider when importing identification files”: 50
  2. Rename as “MetaNovo Compact Database”.
Question
  1. Why are we reducing the size of the database?
  2. Why is this running TMT10 plex modification when the data is 11-plex?
  3. Regarding MetaNovo Spectrum Matching parameters, what are the most “important” parameters? Meaning, that if a user wants to reduce or increase the sensitivity/number of output sequences, what should they change?
  1. Reducing the size of the database improves search speed, FDR, and sensitivity.
  2. There is no option for 11-plex modifications in Metanovo, hence we use the TMT-10plex.
  3. The most important parameters are the tolerance (MS1 and MS2) and any modifications introduced during the processing of the data.

Merging databases to obtain reduced MetaNovo database for peptide discovery with FASTA Merge Files and Filter Unique Sequences

Hands-on: FASTA Merge Files and Filter Unique Sequences
  1. FASTA Merge Files and Filter Unique Sequences ( Galaxy version 1.2.0) with the following parameters:
    • “Run in batch mode?”: Merge individual FASTAs (output collection if input is collection)
      • In “Input FASTA File(s)”:
        • param-repeat “Insert Input FASTA File(s)”
          • param-file “FASTA File”: MetaNovo Compact Database (output of MetaNovo tool)
          • param-file “FASTA File”: Protein Database Human SwissProt (output of Protein Database Downloader tool)
          • param-file “FASTA File”: Protein Database Contaminants (cRAP) (output of Protein Database Downloader tool)

Conclusion

The first step for the Clinical Metaproteomics study is database generation. As we didn’t have a reference database or information from 16srRNA-seq data, we generated a fasta database doing a literature survey, however, if 16S rRNA data is present, the taxon identified can be used for a customized database generation. As the size of the comprehensive database is generally too large, we used the Metanovo tool to reduce the size of the database. This reduced database will be then used for clinical metaproteomics discovery workflow.