Neoantigen 1: Fusion-Database-Generation
Author(s) | Subina Mehta Katherine Do James Johnson |
Editor(s) | Pratik Jagtap Timothy J. Griffin |
OverviewQuestions:Objectives:
Why do we need to generate a customized fusion database for proteogenomics research?
Requirements:
Downloading databases related to 16SrRNA data
For better neoantigen identification results.
Time estimation: 2 hoursSupporting Materials:Published: Jan 14, 2025Last modification: Jan 14, 2025License: Tutorial Content is licensed under Creative Commons Attribution 4.0 International License. The GTN Framework is licensed under MITversion Revision: 1
A neoantigen is a novel peptide (protein fragment) that is produced by cancer cells due to mutations, including gene fusions, that alter the DNA sequence in a way that generates unique proteins not found in normal cells. Because these mutated proteins are unique to the tumor, they are recognized as “foreign” by the immune system. Neoantigens are valuable in immunotherapy because they can serve as specific targets for the immune system, allowing treatments to selectively attack cancer cells while sparing normal tissue. By stimulating an immune response specifically against these neoantigens, therapies like cancer vaccines or T-cell-based treatments can be developed to enhance the body’s natural defense mechanisms, making neoantigens a promising avenue for personalized cancer treatment.
Creating a fusion database is essential in cancer genomics and personalized medicine, as it enables the identification of crucial biomarkers, enhances diagnostic accuracy, and supports therapeutic development. Gene fusions, where parts of two previously separate genes merge, can produce abnormal proteins that drive cancer. Cataloging these fusion events in a database helps researchers identify specific biomarkers linked to cancer types and design more targeted treatments. Additionally, fusion events may lead to unique peptide sequences, known as neoantigens, which are found only in cancer cells. These neoantigens can be targeted by the immune system, making fusion databases valuable in designing personalized immunotherapies like cancer vaccines or T-cell therapies. Some gene fusions also create oncogenic proteins that promote tumor growth, such as the BCR-ABL fusion in chronic myeloid leukemia. Including such information in a database aids in identifying potential therapeutic targets and predicting treatment efficacy. On the diagnostic side, known gene fusions serve as reliable markers, helping clinicians better classify cancer types and choose the most effective treatments. Finally, fusion databases provide a critical reference for researchers studying fusion mechanisms, their impact on disease progression, and their prevalence across cancers, ultimately fueling the discovery of novel treatments and therapies.
To generate the fusion database, RNA star and Arriba tools are used in this workflow.
AgendaIn this tutorial, we will cover:
Overview of Fusion Neoantigen Database Workflow
The workflow in this tutorial guides users through the generation of a fusion neoantigen database, covering key steps in bioinformatics to identify, filter, and prepare fusion-specific peptides for further immunological study. Below is an overview of each major stage:
- Get Data The process begins with the upload and quality assessment of raw sequencing data, which is then uncompressed. This stage sets the groundwork for all subsequent analyses.
- Fusion Detection and Alignment RNA sequencing data undergoes alignment to a reference genome using tools like RNA STAR, followed by Arriba to detect fusion events. These tools identify gene fusions and help characterize the gene segments that combine to form new fusion genes.
- Filtering and Refinement After identifying fusions, various filters are applied to remove non-specific or common fusion events using blacklist data and other criteria. This step ensures that only relevant, unique fusion events are retained for neoantigen prediction.
- Peptide Sequence Extraction and Formatting Potential neoantigen peptides are extracted from the fusion gene sequences. Using tools such as Text Reformatting and Tabular-to-FASTA, the data is transformed into formats suitable for further immunological analysis.
- Final Database Formatting The workflow concludes by applying regex adjustments and formatting functions to standardize the output, creating a database of potential fusion neoantigens.
In summary, this workflow provides a structured approach to preparing fusion neoantigen data for downstream applications, such as immunotherapy research, by making fusion-derived peptides accessible in a database for experimental or clinical exploration.
Get data
Hands-on: Data Upload
- Create a new history for this tutorial
Import the files from Zenodo or from the shared data library (
GTN - Material
->proteomics
->Neoantigen 1: Fusion-Database-Generation
):https://zenodo.org/records/14365542/files/human_reference_genome.fasta https://zenodo.org/records/14365542/files/human_reference_genome_annotation.gtf https://zenodo.org/records/14365542/files/RNA-Seq_Reads_1.fastqsanger.gz https://zenodo.org/records/14365542/files/RNA-Seq_Reads_2.fastqsanger.gz
- Copy the link location
Click galaxy-upload Upload Data at the top of the tool panel
- Select galaxy-wf-edit Paste/Fetch Data
Paste the link(s) into the text field
Press Start
- Close the window
As an alternative to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
- Go into Data (top panel) then Data libraries
- Navigate to the correct folder as indicated by your instructor.
- On most Galaxies tutorial data will be provided in a folder named GTN - Material –> Topic Name -> Tutorial Name.
- Select the desired files
- Click on Add to History galaxy-dropdown near the top and select as Datasets from the dropdown menu
In the pop-up window, choose
- “Select history”: the history you want to import the data to (or create a new one)
- Click on Import
- Rename the datasets
Check that the datatype
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, click galaxy-chart-select-data Datatypes tab on the top
- In the galaxy-chart-select-data Assign Datatype, select
datatypes
from “New type” dropdown
- Tip: you can start typing the datatype into the field to filter the dropdown menu
- Click the Save button
Add to each database a tag corresponding to …
Datasets can be tagged. This simplifies the tracking of datasets across the Galaxy interface. Tags can contain any combination of letters or numbers but cannot contain spaces.
To tag a dataset:
- Click on the dataset to expand it
- Click on Add Tags galaxy-tags
- Add tag text. Tags starting with
#
will be automatically propagated to the outputs of tools using this dataset (see below).- Press Enter
- Check that the tag appears below the dataset name
Tags beginning with
#
are special!They are called Name tags. The unique feature of these tags is that they propagate: if a dataset is labelled with a name tag, all derivatives (children) of this dataset will automatically inherit this tag (see below). The figure below explains why this is so useful. Consider the following analysis (numbers in parenthesis correspond to dataset numbers in the figure below):
- a set of forward and reverse reads (datasets 1 and 2) is mapped against a reference using Bowtie2 generating dataset 3;
- dataset 3 is used to calculate read coverage using BedTools Genome Coverage separately for
+
and-
strands. This generates two datasets (4 and 5 for plus and minus, respectively);- datasets 4 and 5 are used as inputs to Macs2 broadCall datasets generating datasets 6 and 8;
- datasets 6 and 8 are intersected with coordinates of genes (dataset 9) using BedTools Intersect generating datasets 10 and 11.
Now consider that this analysis is done without name tags. This is shown on the left side of the figure. It is hard to trace which datasets contain “plus” data versus “minus” data. For example, does dataset 10 contain “plus” data or “minus” data? Probably “minus” but are you sure? In the case of a small history like the one shown here, it is possible to trace this manually but as the size of a history grows it will become very challenging.
The right side of the figure shows exactly the same analysis, but using name tags. When the analysis was conducted datasets 4 and 5 were tagged with
#plus
and#minus
, respectively. When they were used as inputs to Macs2 resulting datasets 6 and 8 automatically inherited them and so on… As a result it is straightforward to trace both branches (plus and minus) of this analysis.More information is in a dedicated #nametag tutorial.
Data preparation
Convert compressed file to uncompressed
Uncompressing data is a crucial first step in many bioinformatics workflows because raw sequencing data files, especially from high-throughput sequencing, are often stored in compressed formats (such as .gz
or .zip
) to save storage space and facilitate faster data transfer. Compressed files need to be uncompressed to make the data readable and accessible for analysis tools, which generally require the data to be in plain text or other compatible formats. By uncompressing these files, we ensure that downstream applications can efficiently process and analyze the raw sequencing data without compatibility issues related to compression. In this workflow, we do that for both forward and reverse files.
Hands-on: Converting compressed to uncompressed
- Convert compressed file to uncompressed. with the following parameters:
- param-file “Choose compressed file”:
RNA-Seq_Reads_1.fastqsanger.gz
(Input dataset)
Hands-on: Task description
- Convert compressed file to uncompressed. with the following parameters:
- param-file “Choose compressed file”:
RNA-Seq_Reads_2.fastqsanger.gz
(Input dataset)
Alignment with RNA STAR
RNA STAR (Spliced Transcripts Alignment to a Reference) is a high-performance tool used to align RNA sequencing (RNA-seq) reads to a reference genome. It identifies the best matches between RNA reads and genome sequences by detecting exon-exon junctions, which are critical for accurately mapping reads from spliced transcripts. RNA STAR uses a “two-pass” mapping approach that first identifies splice junctions across all reads and then uses these junctions to guide a more accurate alignment on the second pass. This capability is especially valuable for studying gene expression, discovering novel splice variants, and identifying fusion genes in cancer and other disease research. The output includes aligned sequences that can be used in subsequent steps of bioinformatics pipelines, such as fusion detection and differential expression analysis.
Hands-on: Spliced transcripts Alignment to a human reference
- RNA STAR ( Galaxy version 2.7.10b+galaxy4) with the following parameters:
- “Single-end or paired-end reads”:
Paired-end (as individual datasets)
- param-file “RNA-Seq FASTQ/FASTA file, forward reads”:
RNA-Seq_Reads_1.fastqsanger
(output of Convert compressed file to uncompressed. tool)- param-file “RNA-Seq FASTQ/FASTA file, reverse reads”:
RNA-Seq_Reads_2.fastqsanger
(output of Convert compressed file to uncompressed. tool)- “Custom or built-in reference genome”:
Use a built-in index
- “Reference genome with or without an annotation”:
use genome reference without builtin gene-model but provide a gtf
- “Select reference genome”:
Human Dec. 2013 (GRCh38/hg38) (hg38)
- param-file “Gene model (gff3,gtf) file for splice junctions”:
human_reference_genome_annotation.gtf
(Input dataset)- “Per gene/transcript output”:
No per gene or transcript output
- “Use 2-pass mapping for more sensitive novel splice junction discovery”:
Yes, perform single-sample 2-pass mapping of all reads
- “Report chimeric alignments?”:
Within the BAM output (together with regular alignments; WithinBAM SoftClip) soft-clipping in the CIGAR for supplemental chimeric alignments
- In “Output filter criteria”:
- “Would you like to set additional output filters?”:
No
- In “Algorithmic settings”:
- “Configure seed, alignment and limits options”:
Use parameters suggested for STAR-Fusion
- “Compute coverage”:
No coverage
Question
- What is RNA STAR, and what does it do?
- How do I interpret the alignment statistics in STAR’s output?
- STAR is a tool for aligning RNA-Seq reads to a reference genome, helping researchers understand gene expression and identify splice junctions. STAR requires RNA-Seq reads, usually in FASTQ format. It also needs a reference genome file in FASTA format and annotation files in GTF/GFF format to build an index. STAR outputs alignments in BAM/SAM format, as well as splice junction files. It can also provide additional alignment stats in log files.
- STAR provides logs with mapping statistics, such as the percentage of uniquely mapped reads, which can be useful for quality control. Aligned BAM files from STAR can be visualized in genome browsers like IGV (Integrative Genomics Viewer) to examine coverage and splicing.
Fusion detection with Arriba
Arriba is a specialized tool used for detecting gene fusions from RNA sequencing (RNA-seq) data. It is particularly focused on identifying fusion events in cancer, where gene fusions can drive oncogenic processes. Arriba uses the output from RNA STAR alignments, specifically looking at chimeric alignments that result from fusion transcripts, and applies a series of filtering steps to reduce false positives.
Arriba’s pipeline includes features for:
- Filtering out common artifacts and false-positive fusions based on blacklisted regions.
- Annotating fusion breakpoints.
- Generating a visualization of detected fusion events.
The output includes a list of fusion candidates with key information like fusion partners, breakpoint locations, reading frames, and peptide sequences. Arriba’s results can provide insight into potential neoantigens, helping guide research into therapeutic targets or immune-based therapies for cancer.
Hands-on: Fusion detection
- Arriba ( Galaxy version 2.4.0+galaxy1) with the following parameters:
- param-file “STAR Aligned.out.sam”:
mapped_reads
(output of RNA STAR tool)- “Genome assembly fasta (that was used for STAR alignment)”:
From your history
- param-file “Genome assembly fasta”:
human_reference_genome.fasta
(Input dataset)- “Genome GTF annotation source”:
From your history
- param-file “Gene annotation in GTF format”:
human_reference_genome_annotation.gtf
(Input dataset)- param-file “File containing blacklisted ranges.”:
blacklist
(output of Arriba Get Filters tool)- param-file “File containing protein domains”:
protein_domains
(output of Arriba Get Filters tool)- param-file “File containing known fusions”:
known_fusions
(output of Arriba Get Filters tool)- “Use whole-genome sequencing data”:
no
- “Generate visualization”:
Yes
- param-file “Cytobands”:
cytobands
(output of Arriba Get Filters tool)
Question
- What is ARRIBA, and what does it do?
- How can I ensure ARRIBA finds specific known fusions?
- ARRIBA is a tool for detecting gene fusions in RNA-Seq data, especially helpful for identifying cancer-associated fusions and other structural variations. ARRIBA needs:A sorted BAM file with RNA-Seq reads aligned by STAR; STAR’s chimeric output (Chimeric.out.junction) to identify candidate fusion junctions; Reference annotation files, like a gene annotation GTF file and a blacklist file to filter false positives.
- Ensure that the STAR alignment and ARRIBA parameters are optimized for sensitivity. Adjusting settings for segment length and alignment quality in STAR can improve detection of specific known fusions.
Postprocessing
Clean up data using Text reformatting
Text Reformatting is a step used in bioinformatics workflows to manipulate and clean up data for easier downstream processing. In fusion detection workflows, text reformatting is often used to parse and restructure output files, making the data consistent and accessible for subsequent analysis steps.
In this workflow, text reformatting involves:
- Extracting specific columns or fields from tabular outputs, such as gene names, breakpoint coordinates, or fusion peptide sequences.
- Formatting peptide sequences and related information into specific columns or concatenating fields for unique identifiers.
- Converting the data into a consistent format that downstream tools can interpret, such as converting tab-separated values into a structured layout for database input or analysis.
The reformatting step ensures that the processed data adheres to the requirements of other tools, enabling seamless integration across the workflow and supporting reliable, interpretable final results.
Hands-on: Formating Arriba output
- Text reformatting ( Galaxy version 1.1.2) with the following parameters:
- param-file “File to process”:
fusions_tsv
(output of Arriba tool)- “AWK Program”:
(NR==1){ for (i=1;i<=NF;i++) { if ($i ~ gene1) { gene1 = i; } if ($i == gene2) { gene2 = i; } if ($i == breakpoint1) { breakpoint1 = i; } if ($i == breakpoint2) { breakpoint2 = i; } if ($i == reading_frame) { reading_frame = i; } if ($i == peptide_sequence) { pscol = i; } } } (NR>1){ pseq = $pscol if (pseq != .) { bp = index(pseq,|); pos = bp - 8; n=split(pseq,array,|); pep = toupper(array[1] array[2]) sub([*],,pep) g1 = $gene1; g2 = $gene2; sub([(,].*,,g1); sub([(,].*,,g2); id = g1 _ g2 brkpnts = $breakpoint1 _ $breakpoint2 neopep = substr(pep,pos) if ($reading_frame == in-frame) { neopep = substr(pep,pos,16) } print(id \t (NR-1) \t brkpnts \t neopep); } }
Data refinement with Query Tabular
Query Tabular is a bioinformatics tool used to extract and manipulate specific data from tabular datasets in workflows. This tool allows users to perform SQL-like queries on tabular data, enabling them to filter, aggregate, and transform datasets based on user-defined criteria.
In this workflow, the Query Tabular tool is employed for several purposes:
- Data Filtering: Users can select specific rows based on certain conditions (e.g., filtering fusions that meet particular criteria).
- Column Manipulation: Users can specify which columns to retain or create new columns by combining or transforming existing data.
- Aggregation: The tool allows for summarizing data, such as counting occurrences of specific fusion events or summarizing results based on particular categories.
- Output Customization: Users can format the output to suit downstream processing needs, making it easier to pass data to subsequent analysis tools.
By leveraging Query Tabular, researchers can efficiently refine and structure their data, ensuring that only relevant information is carried forward in the workflow, ultimately aiding in the identification and analysis of significant biological insights.
Hands-on: Manipulating the data to extract fusions
- Query Tabular ( Galaxy version 3.3.1) with the following parameters:
- In “Database Table”:
- param-repeat “Insert Database Table”
- param-file “Tabular Dataset for Table”:
outfile
(output of Text reformatting tool)- In “Table Options”:
- “Specify Column Names (comma-separated list)”:
c1,c2,c3,c4
- “SQL Query to generate tabular output”:
SELECT t1.c1 || '__' || t1.c2 || '__' || t1.c3, t1.c4 FROM t1
- “include query result column headers”:
No
Transform data using Tabular-to-FASTA
Tabular to FASTA conversion is a common task in bioinformatics that transforms data structured in a tabular format (such as CSV or TSV) into FASTA format, widely used for representing nucleotide or protein sequences. This conversion is essential when sequence data needs to be input into various bioinformatics tools or databases that require FASTA-formatted files.
Hands-on: Converting tabular to fasta
- Tabular-to-FASTA ( Galaxy version 1.1.1) with the following parameters:
- param-file “Tab-delimited file”:
output
(output of Query Tabular tool)- “Title column(s)”:
c['1']
- “Sequence column”:
c2
Using Regex Find And Replace
Using regex (regular expressions) for find and replace is a powerful technique for text manipulation, allowing you to search for patterns and replace them with desired text. Below is a guide on how to use regex for find and replace, including examples in different programming languages. In this context, we are adding “fusion” to the database header.
Hands-on: Adding fusion tag in the fasta header
- Regex Find And Replace ( Galaxy version 1.0.3) with the following parameters:
- param-file “Select lines from”:
output
(output of Tabular-to-FASTA tool)- In “Check”:
- param-repeat “Insert Check”
- “Find Regex”:
>(\b\w+\S+)(.*$)
- “Replacement”:
>generic|fusion_\1|\2
Regular expressions are a standardized way of describing patterns in textual data. They can be extremely useful for tasks such as finding and replacing data. They can be a bit tricky to master, but learning even just a few of the basics can help you get the most out of Galaxy.
Finding
Below are just a few examples of basic expressions:
Regular expression Matches abc
an occurrence of abc
within your data(abc|def)
abc
ordef
[abc]
a single character which is either a
,b
, orc
[^abc]
a character that is NOT a
,b
, norc
[a-z]
any lowercase letter [a-zA-Z]
any letter (upper or lower case) [0-9]
numbers 0-9 \d
any digit (same as [0-9]
)\D
any non-digit character \w
any alphanumeric character \W
any non-alphanumeric character \s
any whitespace \S
any non-whitespace character .
any character \.
{x,y}
between x and y repetitions ^
the beginning of the line $
the end of the line Note: you see that characters such as
*
,?
,.
,+
etc have a special meaning in a regular expression. If you want to match on those characters, you can escape them with a backslash. So\?
matches the question mark character exactly.Examples
Regular expression matches \d{4}
4 digits (e.g. a year) chr\d{1,2}
chr
followed by 1 or 2 digits.*abc$
anything with abc
at the end of the line^$
empty line ^>.*
Line starting with >
(e.g. Fasta header)^[^>].*
Line not starting with >
(e.g. Fasta sequence)Replacing
Sometimes you need to capture the exact value you matched on, in order to use it in your replacement, we do this using capture groups
(...)
, which we can refer to using\1
,\2
etc for the first and second captured values. If you want to refer to the whole match, use&
.
Regular expression Input Captures chr(\d{1,2})
chr14
\1 = 14
(\d{2}) July (\d{4})
24 July 1984 \1 = 24
,\2 = 1984
An expression like
s/find/replacement/g
indicates a replacement expression, this will search (s
) for any occurrence offind
, and replace it withreplacement
. It will do this globally (g
) which means it doesn’t stop after the first match.Example:
s/chr(\d{1,2})/CHR\1/g
will replacechr14
withCHR14
etc.You can also use replacement modifier such as convert to lower case
\L
or upper case\U
. Example:s/.*/\U&/g
will convert the whole text to upper case.Note: In Galaxy, you are often asked to provide the find and replacement expressions separately, so you don’t have to use the
s/../../g
structure.There is a lot more you can do with regular expressions, and there are a few different flavours in different tools/programming languages, but these are the most important basics that will already allow you to do many of the tasks you might need in your analysis.
Tip: RegexOne is a nice interactive tutorial to learn the basics of regular expressions.
Tip: Regex101.com is a great resource for interactively testing and constructing your regular expressions, it even provides an explanation of a regular expression if you provide one.
Tip: Cyrilex is a visual regular expression tester.
- Rename the output FASTA as
Arriba-Fusion-Database.fasta
Conclusion
The workflow outlined above demonstrates a systematic approach to processing biological data, emphasizing the importance of each step in ensuring accurate and reliable results. By integrating tools like RNA-STAR for alignment and Arriba for structural variant detection, researchers can effectively analyze complex genomic information. The transition from tabular data to FASTA format and the application of regex for find-and-replace operations further streamline data management, enhancing efficiency and clarity. Ultimately, this workflow not only facilitates the identification of neoantigens but also contributes to the broader goals of personalized medicine and targeted therapies. By leveraging these methodologies, researchers can gain deeper insights into the genetic underpinnings of diseases and advance the development of innovative treatments.
Rerunning on your own data
To rerun this entire analysis at once, you can use our workflow. Below we show how to do this:
Hands-on: Running the Workflow
Import the workflow into Galaxy:
Hands-on: Importing and launching a GTN workflow
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on galaxy-upload Import at the top-right of the screen
- Paste the following URL into the box labelled “Archived Workflow URL”:
https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/neoantigen-1-fusion-database-generation/workflows/main_workflow.ga
- Click the Import workflow button
Below is a short video demonstrating how to import a workflow from GitHub using this procedure:
Run Workflow workflow using the following parameters:
- “Send results to a new history”:
No
- param-file “RNA-Seq_Reads_1 (forward strand)”:
RNA-Seq_Reads_1.fastqsanger.gz
- param-file “RNA-Seq_Reads_2 (reverse strand)”:
RNA-Seq_Reads_2.fastqsanger.gz
- param-file “Human Reference Genome Annotation”:
human_reference_genome_annotation.gtf
- param-file “Human Reference Genome”:
human_reference_genome.fasta
- Click on Workflow on the top menu bar of Galaxy. You will see a list of all your workflows.
- Click on the workflow-run (Run workflow) button next to your workflow
- Configure the workflow as needed
- Click the Run Workflow button at the top-right of the screen
- You may have to refresh your history to see the queued jobs
Disclaimer
Please note that all the software tools used in this workflow are subject to version updates and changes. As a result, the parameters, functionalities, and outcomes may differ with each new version. Additionally, if the protein sequences are downloaded at different times, the number of sequences may also vary due to updates in the reference databases or tool modifications. We recommend the users to verify the specific versions of software tools used to ensure the reproducibility and accuracy of results.