FusionGDB3: Integrating Breakpoints, Generating Personalized Sequences, and Inferring Tumorigenic Scenarios for Human Fusion Genes

Home

Download

Statistics

Examples

About

Contact

Navigation
0. Global Impact & Trusted Users.
1. Resource of Human Fusion Gene Breakpoints.
2. Clinical Assessment of each Gene in Pan-Disease Fusion Genes.
3. DNA sequence-based Fusion Gene Prediction and Fusion Genomic Features.
4. Open Reading Frame (ORF) Analysis and Generation of Fusion Sequences
5. Protein Functional Annotation.
6. Tumorigenic Mechanism of Action Scenarios of Fusion Genes
7. Personalized Fusion Protein Sequence Generation (FusionSeqDB)
8. Three Categories of Fusion Translation
9. Fusion Gene Targeting Studies

0. Global Impact & Trusted Users.
FusionGDB resources are widely integrated, benchmarked, and trusted by premier clinical networks, world-leading cancer institutes, and global bioinformatics consortiums.
Clinical Genomics Platforms: Used for variant tiering and functional filtering by national initiatives and interpretation engines, including Genomics England and VarSome (Saphetor).
Top-Tier Cancer Research Institutes: Referenced in oncogenic mechanism discovery by researchers at MD Anderson Cancer Center, MSKCC, and Dana-Farber Cancer Institute.
Data Hubs & AI Modelers: Highly cited as a gold standard database for training fusion-predictive machine learning models and biological cross-referencing (e.g., UniProt ecosystem and Broad Institute associated studies).

1. Resource of Human Fusion Gene Breakpoints.
We obtained RNA-seq based fusion gene breakpoint information from TCGA, CCLE, cBioPortal, ChimerDB4, ChimerKB4, ChildHoodFusions, and GTEx. From GenBank, we downloaded the nucleotide sequences and identified the breakpoints by running the BLAT. For the WGS-based fusions, we downloaded all the structural variant breakpoint information from dbVar. All genome coordinates of breakpoints were lifted over to hg19 since majority were identified from hg19.

2. Clinical Assessment of each Gene in Pan-Disease Fusion Genes.
Individual genes can involve in the formation of fusion genes from multiple breakpoints with multiple partner genes in multiple cancer types. Based on these numerical features, we introduced a concept of the degree of frequency (DoF) score, which is the product of three values above each gene in our previous study. For example, in the application of this DoF to ~500 human kinomes, we identified multiple features of driver kinase fusion genes (Kim et al., 2016). We also introduced another metric, called the Major Active Isofusion Index (MAII) (Kim et al., 2017), which is calculated by dividing the number of observed samples with a particular fusion gene by its DoF score. Here, the isofusion refers to one particular gene fusion combination, with one particular partner gene and one particular breakpoint, in one particular cancer type. This new score (MAII) can give us the average frequency of each gene for each possible isofusion. By applying the MAII scoring to the transcription factor fusion genes, we found PLAU, which encodes plasminogen activator urokinase and serves as a biomarker for tumor invasion, was found to be consistently activated in the samples with the highest MAII scores. However, these trials are not utilizing all numerical features of individual genes in pan-cancer fusion genes. To better rank the importance of individual genes in pan-cancer fusion genes, we can utilize all existing features.
To help choose the clinically significant kinase so that the cancer patients that have fusion genes can be better diagnosed, we need a metric to infer the assessment of kinases in pan-cancer fusion genes rather than relying on the sample frequency expressed fusion genes. Most of all, multiple studies assessed human kinases as the drug targets using multiple types of genomic and clinical information, but none used the kinase fusion genes in their study. The assessment studies of kinase without kinase fusion gene events can miss the effect of one of the mechanisms that enhance the kinase function in cancer. To fill this gap, we suggested a novel way of assessing genes using a network propagation approach to infer how likely individual kinases influence the kinase fusion gene network composed of ~5K kinase fusion gene pairs (Cheng et al., 2024). To select a better seed of propagation, we chose the top genes via dimensionality reduction like a principal component or latent layer information of six features of individual genes in pan-cancer fusion genes. We tried to infer the clinical assessment of genes in fusion gene network not only relying on the expressed sample size. In FusionGDB3, the users can access the # expressed samples, # expressed disease types, # partner genes, # breakpoints, DoF score, MAII score. These numerical information can provide insight to infer the involvement of individual genes across pan-disease fusion genes, which is helpful for fusion gene diagnosis kit design and development.
** # partners: number of partner genes of one gene in pan-disease fusion genes.
** # breakpoints: number of inside breakpoints of one gene in pan-disease fusion genes.
** # disease types: number of expressed fusion gene samples of one gene.
** DoF score (Degree of Frequency) of each gene = # partners X # break points X # disease types.
** MAII score (Major Active Isofusion Index) of each gene = log2(# samples/DoF score*10).

3. DNA sequence-based Fusion Gene Prediction and Fusion Genomic Features.
Previously, we developed a deep learning-based classifier between fusion gene and no fusion gene breakpoint sequences (FusionAI (Kim et al., iScience, 2021)). For all fusion genes whose breakpoints are located at the exon junction boundaries, we can run FusionAI. For all exon junction breakpoint fusion genes in three categories of ORFs (In-frame fusion genes, 5UTR-3CDS fusions (N-truncated proteins), and 5CDS-3UTR fusions (C-truncated proteins),) We ran the FusionAI and got the probability of fusion gene breakpoint.

We explored the Epigenomic Profiling of human fusion gene breakpoint area using AlphaGenome. To explore the epigenetic baseline surrounding fusion junctions, we utilized AlphaGenome to profile the breakpoint area flanking +/- 5kb across multi-omics tracks: DNase-seq / ATAC-seq: Highlight chromatin openness and active regulatory regions. RNA-seq / CAGE: Capture transcript abundance, accumulation, and precise transcription start sites. ChIP-Histone / ChIP-TF: Map histone modifications and transcription factor binding architectures.
From this analysis of the epigenomic landscapes, we may be able to infer the Putative Functional Mechanisms. These baseline profiles do not definitively prove causality, but offer a data-driven framework to infer the putative mechanisms driving fusion oncogenesis.
alphagenome landscape

- Promoter/Enhancer Hijacking Driven by Tissue-Specific Elements (TMPRSS2-ERG, IGH-BCL2, IGH-MYC): This category features a striking asymmetry across the fusion junction. The 5' partner genes (such as TMPRSS2 or IGH) exhibit dense open chromatin peaks (DNase, ATAC-seq) and strong transcriptionally active signatures (ChIP-Histone, ChIP-TF) strictly localized near the breakpoint, contrasting with a relatively quiescent baseline at the 3' partner locus. The genomic profiles indicate that a downstream proto-oncogene (ERG, BCL2, or MYC) is translocated into a tissue-specific, hyperactive regulatory domain. The structural rearrangement likely hijacks the highly robust, lineage-specific promoter or super-enhancer elements of the 5' partner, leading to the pathologically driven overexpression of the newly juxtaposed oncogenic factor.
- Uncontrolled Kinase Signaling with Persistent Transcriptional Readiness (BCR-ABL1, EML4-ALK, TPM3-NTRK1, ETV6-NTRK3): Characterized by highly synchronized, continuous, and stable RNA-seq signal elongation that seamlessly crosses the breakpoint interface. The chromatin layout shows well-conserved, broad histone accessibility spanning both the 5' and 3' partner loci, without localized transcriptional crashing. This structural signature indicates a robust baseline transcriptional tolerance that yields a fully stable chimeric mRNA. The resulting fusion proteins evade normal auto-regulatory checks through the loss of intrinsic auto-inhibitory domains, facilitating ligand-independent oligomerization and driving constitutive downstream tyrosine kinase signaling cascades.
- Hyperactive Chromatin Hotspots Primed for Aberrant Transcription Factor Binding (EWSR1-FLI1 and NUP214-ABL1): These loci display sharp, heavily overlapping DNase-seq and ATAC-seq peaks that perfectly align with massive, high-amplitude CAGE and ChIP-TF complex enrichment right flanking the fusion breakpoint. The presence of these acute multi-omics hotspots points to an open chromatin framework that serves as a highly active transcriptional seat. This landscape is uniquely optimized for aberrant or unhindered transcription factor assembly, allowing the chimeric proteins to aggressively recruit local chromatin-remodeling machineries or expand their oncogenic footprint across global downstream target genes.
- Post-Transcriptional Disruption and Epigenetic Remodeling Dynamics (FGFR3-TACC3 and KMT2A_AFF1): Distinguished by severe, localized perturbations in RNA-seq processing—such as a massive accumulation of transcripts immediately upstream of the boundary—or sharp, asymmetric drops in specific regulatory histone/TF occupancy marks directly at the breakpoint interface. For FGFR3-TACC3, this structural block suggests an evasion of post-transcriptional silencing mechanisms, where the loss of the native 3'-UTR truncates endogenous miRNA-mediated degradation pathways to secure aberrant mRNA stability. In parallel cases like KMT2A-AFF1, the physical break point severs critical protein-protein interactors within core epigenetic complexes (such as the MLL histone methyltransferase setup), shifting the assembly's functional kinetics toward systemic transcriptional dysregulation.

4. Open Reading Frame (ORF) Analysis and Generation of Fusion Sequences
To analyze the ORF fusion fusion genes, we provide three types of annotations. First, using in-house ORF annotation codes, we annotate the ORF of fusion transcript at given breakpoints and transcript isoforms. Between the 5’-partner gene and the 3’-partner gene, we checked the open reading frame of the full-length fusion transcript sequence. When both breakpoints of 5’- and 3’-genes are located inside of coding region (CDS) and the number of fusion transcript sequences from the transcription start site of 5’-gene to transcription end site of 3’-gene is a multiple of three, then we reported this fusion gene as ‘in-frame’. If there is one or two nucleotide insertion, then we reported as the ‘frame-shift’. Except for these two ORFs, there are 15 more ORFs such as ‘3UTR-CDS’, ‘3UTR-3UTR’, ‘3UTR-5UTR’, ‘3UTR-intron’, ‘CDS-3UTR’, ‘CDS-5UTR’, ‘CDS-intron’, ‘5UTR-CDS’, ‘5UTR-3UTR’, ‘5UTR-5UTR’, ‘5UTR-intron’, ‘intron-CDS’, ‘intron-3UTR’, ‘intron-5UTR’, and ‘intron-intron’. Second, we used ORFfiner. To make the input transcript sequences, for the in-frame fusion genes, 5UTR-3CDS fusion genes (N-truncated cases), 5CDS-3UTR fusions (C-truncated cases), we made full-length transcript sequences considering multiple gene isoforms and multiple breakpoints in the individual partner genes. Then, we input these fusion transcript sequences to the ORFfinder and chose the longest ORF for each fusion transcript as the fusion peptide sequence. Last, we used a deep learning classifier to distinguish coding vs non-coding transcripts, named as deepORF. For the deep learning model design, we adopted it from RNAsamba, a neural network-based assessment of the protein-coding potential of RNA sequences. By retraining the model using high-quality training data set as explained in the pipeline FugionGDB 2.0 paper, we could have dramatically increased the performance of the prediction of coding potentials. The prediction scores’ distributions using RNAsamba and deepORF are shown in Figure below.
deepORFcomp

5. Protein Functional Annotation.
We searched the retention of 39 protein features of UniProt (six molecule processing features, 13 region features, four site features, six amino acid modification features, two natural variation features, five experimental info features, and 3 secondary structure features) at the fusion amino acid sequence level. Through this process, we also checked the retention of protein-protein interaction (PPI) at the fusion protein. Detailed information about all of the protein features is on the UniProt page.

FG functions

FGviewer (Kim et al., 2020) provides functional feature annotations at four different levels: DNA-, RNA-, protein-, and pathogenic levels. The same breakpoint line across four tiers will classify between FG involving or non-involving zone with multiple types of functional features. For our fusion gene annotation page, we provide the protein-level annotation.

FG functions

There are diverse categories of mechanisms of action of fusion genes in human diseases as above.

6. Tumorigenic Mechanism of Action Scenarios of Fusion Genes
Given that less than 1% of all known fusion genes have been functionally characterized or reported in literature, traditional experimental approaches and manual curation face an insurmountable scalability bottleneck. In this context, leveraging Large Language Models (LLMs) offers a uniquely feasible paradigm to map the uncharted functional landscape of the remaining 99% of fusion genes. To mitigate the critical risk of model hallucination and ensure rigorous biological validity, we developed an expert-guided knowledge synthesis framework that strictly confines the LLM’s (Gemini) inferential boundary within the three levels of the central dogma of biology: capturing DNA-level transcription factor swapping, RNA-level miRNA regulation avoidance or swapping, and protein-level functional feature retention.
To systematically operationalize this central dogma-guided reasoning flow, we implemented a rigorous five-step prompt architecture enriched with multi-omics reference data to infer potential mechanisms of action (MoA) and predict downstream hallmarks of cancer. Furthermore, to enforce empirical accountability and prevent fabricated references, we developed a standalone Python validation pipeline that assigns real, PubMed-supported literature evidence to the Gemini-generated FusionGDB3 functional scenarios across four curated fields: mechanism, tumorigenic scenario, targeting point, and targeting background.
The validation pipeline operates under a strict data-governance framework. The input scenario annotations are first parsed into granular, claim-level units (claim-cells). For each claim-cell, the pipeline executes a tiered PubMed retrieval plan via NCBI E-utilities, prioritizing fusion-specific evidence before falling back to gene-specific or template-generic evidence tiers. Crucially, to isolate the language model from generating fabricated PMIDs de novo, the LLM is restricted to selecting references strictly from the pre-retrieved candidate list, followed by an automated cross-validation step. GPT-5.4-mini (and GPT-5.4 for complex cases) acts as a specialized evidence judge to evaluate and categorize the support scope (fusion-specific, gene-specific, pathway-specific, or template-generic). When direct fusion-specific evidence is unavailable, partner-gene-level evidence on binding, expression, regulatory function, pathway involvement, or drug relevance can still provide biologically grounded support for the inferred fusion scenario.
To ensure system scalability and throughput while filtering out weak or overgeneralized literature assignments, the workflow incorporates disk-based caching, bounded concurrent query evaluation, template-library overrides, blocked-PMID rules for curated exceptions, and high-reuse PMID auditing. Manual review and quality control are further applied to identify weak, indirect, withdrawn, or non-specific literature assignments and to improve the reliability of the final annotations. Unresolved or manually flagged claim-cells are automatically processed through follow-up query rewriting and re-querying modules.
While the extreme scarcity of existing literature on the vast majority of fusion genes presents an inherent challenge to precision oncology, our framework—combining a structured central dogma reasoning flow with a highly governed, tiered PubMed verification pipeline—represents the most rigorous, comprehensive, and scalable computational solution currently achievable to map and infer the functional scenarios of human fusion genes.

7. Personalized Fusion Protein Sequence Generation (FusionSeqDB)
So far, our fusion protein sequences have been made based on the reference genome sequences and these sequences are imported to UniProt. In this version, after analyzing the ORF, generation of the full-length fusion trancript seqeunces, and translation of 155K human fusion genes, we could generate more than 77K human fusion protein sequences. This time, to integrate the personalized variant context into the fusion protein sequences, we developed a computational pipeline and all combined fusion protein sequences into FusionSeqDB. For ~ 51,000 fusion genes in ~ 8,100 TCGA individual samples, we reflected the SNVs of individual genomes on the wild-type proteins separately, which are involved in fusion genes. We run the blastp for the reference-genome-based fusion protein sequences against both WT protein sequences and SNV/indel-reflected WT protein sequences. Then, we replace the fusion protein sequences with the matched SNVs/indels-reflected WT protein sequences. We reflected 500K (504,696) SNVs in 10K (10,018) TCGA patients, which were mapped to 6,446 unique fusion genes. For the CCLE data, we reflected 60K (60,296) SNVs in 1,920 cancer cell lines into 2,288 unique fusion genes (6,841 fusion transcripts with different breakpoints). From TCGA and CCLE, we generated 1.13 million (1,129,239) personalized fusion protein sequences.
** In FusionGDB3, for each fusion gene page, we provide diverse sequences of individual fusion genes based on all expressed fusion gene breakpoints and GENCODE v19 gene isoform structures. We provide full-length fusion gene transcript sequences, fusion peptide sequences, personalized fusion peptide sequences, and kinase/DNA binding domain mutated fusion peptide sequences.

8. Three Categories of Fusion Translation
Gene fusion events can make diverse types of proteins. Among available ORF combinations, in this version, we focused on the in-frame fusion genes, and two other ORFs that can make truncated proteins (5UTR-3CDS (N-truncated proteins) and 5CDS-3UTR (C-truncated proteins)). FusionGDB3 provides the information about three types of ORFs in fusion genes following the central dogma. We provide the full-length transcript sequences and protein sequences. The 3D structures of all these categories are available on FusionPDB2

9. Fusion Gene Targeting Studies
To advance the development of fusion protein targeting therapeutics, we need to know how fusion oncoprotein patients were/are treated in the clinic and what small molecules or therapies have been studied, which is critical for developing fusion protein targeting and personalized therapeutics. To fill this crucial gap, we developed FusionPub (Kumar et al., 2025). We performed the PubMed abstract search for 107K human fusion genes with all combinations of 14K drugs/small molecules (1.5 billion times). We also performed the PubMed abstract search for 18K human fusion proteins with all combinations of 8.5K and 8.3K gene synonyms of 5’- and 3’-partners with 14K drugs/small molecules (1.8 quintillion times). In FusionGDB3, the users can access to all fusion gene drug studies and more details can be accessed through FusionPub.