FusionAI: a deep learning classifier to predict human FGBPs.

FusionAI: Predicting Fusion Breakpoint from DNA Sequence with Deep Learning
To understand the mechanisms of the formation and action of human fusion gene breakpoints (FGBPs) in diseases, we performed classifying between human FGBP positive and negative sequences context using a convolutional neural network approach. Using FusionAI, we can study the features of fusion gene breakpoints, infer the potential breakage of the user's interested genomic regions.

FusionAI architecture and training using deep learning

To train the FusionAI DNN, we downloaded the fusion gene breakpoint information of 48K of fusion genes from FusionGDB. Since most of the fusion genes are predicted from the split reads of RNA-seq data and the real genomic breakpoints would be located in the intron, we used the sequence of known fusion genes that have the exon junction-junction breakpoints to train the FusionAI model. Out of ~ 48K known fusion events, there were ~33K fusion genes from the TCGA cohort and ~26K fusions had the breakpoints at the exon junction-junction position (j-j BP combination). To make fusion negative breakpoints data, we excluded 17,110 genes, which are involved in 48K known human fusion genes, among 43K GENCODE genes. From the rest of those genes (27,116), which are not known as involved in any fusion genes, we randomly chose two genes as fusion partners. Then, we filtered out potential false cases from the unnecessary multiply mapped cases and breakpoints belong to the repeat region, paralogs, or pseudogenes using RepeatMasker, Duplicated Genes Database, and HUGO database’s pseudogenes. This is the typical pre-process by the fusion prediction tools to filter out false positives. We also excluded the gene pairs with neighboring gene relationships to exclude the potential read-through cases. In the case of the intra-chromosomal fusion genes, we set the minimum distance as 100Kbp between randomly chosen two breakpoints across gene bodies. Then, from the chosen two breakpoints after several filtering steps, a 20Kbp long DNA sequence was made by conjugating +/- 5Kbp sequence from each BP of two partner genes. Through this procedure, we created ~ 26K fusion-negative breakpoint sequence data. Based on these DNA sequences, we trained a multiple-layer deep neural network. The input of the model is a sequence of 20 kb one-hot encoded nucleotides. The output is two probabilities corresponding to fusion-positive and -negative breakpoints that sums to one. Our deep neural network consists of two convolutional layers with filter sizes (20, 4) and (200, 1), one max pooling, one flatten, and two dense layers preceding the output layer. The model involves 2,672,002 parameters including both weight matrix and bias at related layers. 36.4K BPs (~70%) from a combined total of 52K BPs (26K j-j combination BPs and 26K non-FGBPs) were used in the training step (further divided to 80% for training and 20% for validation), and the rest 15.6K BPs (~30%) was used for an independent test. We then tested the trained model on both the 26K original training samples and the 15.6K test samples. The accuracies for training and test data sets were 97.4% (AUCROC=0.9962) and 90.8% (AUCROC=0.9706) with 0.12 and 0.42 error rates, respectively. This performance is much better than the traditional machine learning method, SVM, which yielded the accuracy of 79% and 72% for training and test data, respectively (Figure 1D). You can download the training and test data from here.

Figure 1. Overview of FusionAI. (a) The investigation of fusion gene breakpoints of 48K FGs from FusionGDB identified the BP location across the human genome. (b) Making training and test datasets of fusion-positive and -negative breakpoints. (c) Diagram of fusion gene breakpoints classification by FusionAI. (d) Effect of the size of the input sequence context on the accuracy.

Tutorial for the FusionAI

Requirements
Software and algorithms to run FusionAI
- python (>=3)
- python modules of tensorflow and keras
:conda install -c bioconda tensorflow
:conda install -c conda-forge keras

Software and algorithms to draw feature landscape image (if you do not need to draw this image, then don't need to install these software)
nibFrag: http://hgdownload.soe.ucsc.edu/admin/jksrc.zip or here
R (>=3.5): https://www.r-project.org/
bedtools (>=2.26.0): https://bedtools.readthedocs.io/en/latest/content/installation.html
bedtoolsr (2.30.0.1): http://phanstiel-lab.med.unc.edu/bedtoolsr-install.html
optparse (>=1.6.0): https://cran.r-project.org/web/packages/optparse/index.html
doParallel (1.0.16): https://cran.r-project.org/web/packages/doParallel/index.html
Iterators (1.0.13): https://cran.r-project.org/web/packages/iterators/index.html
magrittr (2.0.1): https://cran.r-project.org/web/packages/magrittr/index.html
foreach (1.5.1): https://cran.r-project.org/web/packages/foreach/index.html
ggplot2 (3.3.5): https://cran.r-project.org/web/packages/ggplot2/index.html
gridExtra (2.3): https://cran.r-project.org/web/packages/gridExtra/index.html
scales (1.1.1): https://cran.r-project.org/web/packages/scales/index.html
cowplot (1.1.1): https://cran.r-project.org/web/packages/cowplot/index.html
ggpubr (>=0.1.7): https://cran.r-project.org/web/packages/ggpubr/index.html

Needed files (Please create a new directory, named FusionAI and download all files and scripts under this folder.)
- gencode_hg19v19_.txt: GENCODE version 19 gene structure information
- nib_files_hg19.tar.gz: hg19 nib files
- chromosome_size.txt: chromosome size information
- features_info.txt: feature information
- features.tar.gz: 44 genomic features of human genome

FusionAI codes
- newdat_newmod_jj.h5: The FusionAI CNN model
- pre_processing_for_FusionAI_from_tab_delim.py: A python script to make input data for FusionAI from fusion gene information
- FusionAI_pred.py: A python script to predict the ouput scores FusionAI model
- FusionAI_FIS.py: A python script to calculate the feature importance scores (FIS) across 20Kb fusion DNA sequence
- FusionAI_genomic_features.R: An R script to draw a figure for the distribution of 44 human genomic feautures across 20Kb fusion DNA sequence
- FusionAI_genomic_features2.R: An R script to draw a figure for the distribution of 44 human genomic feautures across 20Kb fusion DNA sequence focusing on the top 10% FIS regions only

Example fusion gene information files
- k562_starfusion.txt: fusion genes that were predicted using STAR-fusion for K562 cell.

FusionAI protocol
1. Data preparation
: Read fusion gene information as the input and make 20k DNA sequence by combining of +/- 5kb flanking sequence from two breakpoints as the input of FusionAI.
: $ python pre_processing_for_FusionAI_from_tab_delim.py [INPUT_FILE]
: $ python pre_processing_for_FusionAI_from_tab_delim.py k562_starfusion.txt

2. Run FusionAI
: Predict fusion breakpoint tendency using FusionAI model. Here the $INPUT_FILE is the output file after making 20Kb DNA sequence in step 2. $COLA and $COLB are the DNA sequences of 5' and 3' fusion partner genes that were created from step 2. If the user wants to run for one specific fusion gene, then set $INDEX_OF_FUSION as the number of raw in the fusion gene file from the previous step.
: $ python FusionAI_FIS.py [-h] -f FILENAME [-m MODEL, default: newdat_newmod_jj.h5] [-o OUTPUT] [-A COLA] [-B COLB] [-I ROWI] [-N NGPUS]
: $ python FusionAI_pred.py -f k562_starfusion.FusionAI.input -o k562_starfusion.FusionAI.output -m newdat_newmod_jj.h5

3. Select high FusionAI prediction scored fusions or interested fusions (using Excel program).
: The users can cut at 0.5 or 0.9 or 0.95 based on the numbers and overalpping with fusion prediction results by other tools.

4. Calculate the feature importance scores across 20Kb fusion DNA sequence.
: Check the feature importance across 20Kb fusion DNA sequence of the selected fusion genes. If the user wants to run for one specific fusion gene, then set $INDEX_OF_FUSION as the number of raw in the fusion gene file from the previous step.
: $ python FusionAI_FIS.py [-h] -f [INPUT_FILE] -m [MODEL] -o [OUTPUT] -A [COLA] -B [COLB] -I [INDEX_OF_FUSION]
: $ python FusionAI_FIS.py -f k562_starfusion.FusionAI.output -o k562_starfusion.FusionAI.output.FIS

5. Visualize 44 human genomic features across 20Kb DNA sequence.
5-1. Draw the distribution of 44 human genomic feautures across 20Kb fusion DNA sequence.
: $ Rscript FusionAI_genomic_features.R -g [FUSION_GENE_FILE] -f [FEATURE_PATH] -s [CHROMOSOME_SIZE_FILE] -i [FEATURE_INFO_FILE] -o [OUTPUT_FILE_PATH]
: $ Rscript FusionAI_genomic_features.R -g K562_STARfusion.FusionAI.output.FIS -f ./features/ -s chromosome_size.txt -i features_info.txt -o ./K562/whole_features/

5-2. Draw the distribution of 44 human genomic feautures across 20Kb fusion DNA sequence focusing on the top 10% FIS regions only.
: $ Rscript FusionAI_genomic_features2.R -g [FUSION_GENE_FILE] -f [FEATURE_PATH] -s [CHROMOSOME_SIZE_FILE] -i [FEATURE_INFO_FILE] -o [OUTPUT_FILE_PATH]
: $ Rscript FusionAI_genomic_features2.R -g K562_STARfusion.FusionAI.output.FIS -f ./features/ -s chromosome_size.txt -i features_info.txt -o ./K562/top1pct_features/

Training/testing/validation data
- Fusion data
: We used exon junction-junction breakpoint fusions from TCGA fusion data (26K=18K training + 8K test)

- Training data set
: ~ 18K fusion-positive breakpoint sequences + ~ 18K negative fusion-negative breakpoint sequences
: Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand\tSequence
- Training data information
: Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand\tT or F(1 or 0)
- Simulation RNA-seq split reads of training data
: Simulation RNA-seq split reads aligned at the exon junction-junction breakpoints of fusion events in training data set

- Test data set
: ~ 8K fusion-positive breakpoint sequences + ~ 8K negative fusion-negative breakpoint sequences
: Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand\tSequence
- Test data information
: Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand\tT or F(1 or 0)
- Simulation RNA-seq split reads of teset data
: Simulation RNA-seq split reads aligned at the exon junction-junction breakpoints of fusion events in test data set

- Sanger sequece based fusion data for validation
: Sanger sequece based exon junction-junction breakpoints of fusion events from ChiTaRs-3.1
: Input for FusionAI (20Kbp sequence)
: Simulation RNA-seq split reads

- 2200 fusion events from ~600 cancer cell-lines for validation
: Sanger sequece based exon junction-junction breakpoints of fusion events from 600 cancer cell-lines
: Input for FusionAI (20Kbp sequence)
: Simulation RNA-seq split reads

IO format
-Input by the user (Input in step 1)
:Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand
-Input of FusionAI after pro_processing (Output in step 1-2/Input in step 2)
:Hgene\tHchr\tHbp\tHstrand\tTgene\tTchr\tTbp\tTstrand\t5'-gene sequence (10Kb)\t3'-gene sequence (10Kb)
-output
:Fusion-negative probability\tFusion-positive probability (Output in step 3)

FusionGDB2: fusion gene annotation update aided by deep learning

Gene fusion is one of the hallmarks of cancer genome via chromosomal rearrangement initiated by DNA double-strand breakage. A knowledgebase with the systematic functional annotation of fusion genes is critical for understanding genomic breakage context and developing therapeutic strategies. FusionGDB is a unique functional annotation database of human fusion genes and widely used for the studies with diverse aims. In this study, we report FusionGDB 2.0, which has substantial updates of contents such as up-to-date human fusion genes, fusion gene breakage tendency score with FusionAI deep learning model based on 20kb genomic sequence around BP area, investigation of overlapping between fusion breakpoints with 44 human genomic features across five cellular role’s categories, transcribed chimeric sequence and following open reading frame analysis with coding potential based on deep learning approach with Ribo-seq read features, and rigorous investigation of the protein feature retention of individual fusion partner genes in the protein level. Among ~ 126k fusion genes, about 30k kept their ORF as In-frames, which is tripple size compared to the previous version, FusionGDB. FusionGDB 2.0 will be used as the reference knowledgebase of fusion gene annotations. FusionGDB 2.0 provides seven categories of annotations: Fusion Gene Summary, Fusion Gene ORF, Fusion Gene Genomic Feature, Fusion Protein Feature, Fusion Sequence, Related Drug, and Related Disease.

About us

Pora Kim, MS, PhD, Hua Tan, PhD, and Xiaobo Zhou, PhD

Email: [email protected] and [email protected]

Mailing address:

  Center for Computational Systems Medicine
  School of Biomedical Informatics
  The University of Texas Health Science Center at Houston
  7000 Fannin Street, Houston, TX 77030

Citation
    To cite the FusionGDB website in a publication, please quote the following:
    - Kim P*, Tan H, Liu J, Lee H, Jung H, Kumar H, and Zhou X. FusionGDB 2.0: fusion gene annotation update aided by deep learning. Nucleic Acids Res. 2021 Nov 10; doi: 10.1093/nar/gkab1056
    - Kim P*, Tan H, Liu J, Yang M, and Zhou X*, FusionAI: Predicting fusion breakpoint from DNA sequence with deep learning. iScience. 2021 Sep 25; 24(10):103164. doi: 10.1016/j.isci.2021.103164.
    - Kim P*, Yiya K*, and Zhou X*. FGviewer: an online visualization tool for functional features of human fusion genes. Nucleic Acids Res. 2020 Jul 2;48(W1):W313-W320.
    - Kim P and Zhou X*. FusionGDB: fusion gene annotation DataBase. Nucleic Acids Res. 2019 Jan 8;47(D1):D994-D1004