========= Demo ========= .. contents:: :local: demo1_motif: ============== Predict protein-DNA binding affinity matrices for target transcription factors (TFs) and identify associated DNA motifs. In demo1, we assume the **independence** of each nucleotide. Demo1 contains two types of transcription factors (TF): ACE2 and SWI4, corresponding to the two sub-demos demo1_independent_ace2_0.1K.sh and demo1_independent_swi4_2K.sh respectively. .. code-block:: bash sbatch job_demo1 # Alternatively, you can run the individual demos: ./demo1_independent_ace2_0.1K.sh ./demo1_independent_swi4_2K.sh Taking ACE2 as an example (demo1_independent_ace2_0.1K.sh), here is a basic usage example. Input --------------- 1. demo1_motif/in/ace2_0.1K/demo_ace2_0.1Kseq_gctggt.txt 2. demo1_motif/in/ace2_0.1K/demo_ace2_0.1Kseq_gctggt.fa Code ----------- In the demo1_independent_ace2_0.1K.sh script, we want to compare the time taken with and without the parallel option. The following code activates the **bpf bayesPI** functionality, utilizing the **parallel** flag and specifying the number of **user_cores**. .. code-block:: bash echo "Run bpf using 5 cores for parallelization" echo "Export at: demos/demo1_independent/out/demo1_out_bpf_t1/ace2_0.1K" (time bpf bayesPI -seed=1 -initial_beta=500,50,5 -max_evidence=3 -dependence=0 -strand=0 -max_iteration=500 -normalize=2 \ -max_loop=3 \ -out=../out/demo1_out_bpf_t1/ace2_0.1K \ -exp=../in/ace2_0.1K/demo_ace2_0.1Kseq_gctggt.txt \ -seq=../in/ace2_0.1K/demo_ace2_0.1Kseq_gctggt.fa \ -parallel -user_cores=5 -min_L=6 -max_L=10 -p_value=0.001) 2 > ../out/demo1_out_bpf_t1/ace2_0.1K/error.demo1_ace2_0.1K_bpf Output --------------- In the above example, the motif length is set to a minimum of 6 bases and a maximum of 10 bases. Additionally, the parallel options are set with -user_cores=5. As a result, the final output includes a PSAM matrix for each motif length, along with the corresponding log files, as shown below: .. code-block:: bash out/demo1_out_bpf_t1 |- ace2_0.1k | |- demo_ace2_0.1Kseq_gctggt.txt_mlpout_L6_1.mlp | |- demo_ace2_0.1Kseq_gctggt.txt_mlpout_L7_1.mlp | |- demo_ace2_0.1Kseq_gctggt.txt_mlpout_L8_1.mlp | |- demo_ace2_0.1Kseq_gctggt.txt_mlpout_L9_1.mlp | |- demo_ace2_0.1Kseq_gctggt.txt_mlpout_L10_1.mlp | |- log_S1_L6 | |- log_S2_L7 | |- log_S3_L8 | |- log_S4_L9 | |- log_S5_L10 | |- log_S5_L10 | |- error.demo1_ace2_0.1K_bpf Among them, when motif_length=6, the results of the binding affinity matric in demo_ace2_0.1Kseq_gctggt.txt_mlpout_L6_1.mlp are: .. csv-table:: Binding affinity matrix of motif :header: "nucleotide", "a", "c", "g", "t" :widths: 15, 20, 25, 30, 35 "c", "0.74724306686324027", "1.2759837425504077", "0.56698383886360115", "0.95816838046764075" "g", "0.40376773149443163", "0.77463518426667954", "3.2019016654587409", "0.51722884120337764" "c", "0.52962660830922337", "4.1522862696325049", "0.31629466850952381", "0.74468155290330562" "t", "0.67986460124544101", "0.57764506718348219", "0.31870531423072723", "4.1385360918265537" "g", "0.34612747709397218", "0.56201213675680073", "3.5345202847981922", "0.7533682281783346" "g", "0.69663332358645791", "0.43659844547252891", "3.0646109714127938", "0.55572222792180148" The sequence_logo of demo_ace2_0.1Kseq_gctggt.txt_mlpout_L6_1.mlp is as follows: .. image:: /example_figs/demo1_ace2_0.1Kseq_no_methy.jpg :alt: Alternative text describing the image :width: 600px :align: center demo2_dinuc: ============== In demo2, we consider DNA structural features as a specific form of **position-specific dinucleotide dependency**, using the **bpf shape_heatmap** functionality to generate the heatmap. .. code-block:: bash # run the demo2 ./demo2_dinucleotide_dependence.sh Input --------------- 1. demo2_dinuc/in/Gcm1_3732.1_v2_deBruijn.aligned.train.fasta.mini 2. demo2_dinuc/in/Gcm1_3732.1_v2_deBruijn.signal Code ----------- .. code-block:: bash echo "Run the parallel bayesPI using 6 cores" echo "Export at: demos/demo2_dinucleotide_dependence/out/demo2_out_bpf_t1" echo (time bpf bayesPI -out=../out/demo2_out_bpf_t1 \ -seq=../in/Gcm1_3732.1_v2_deBruijn.aligned.train.fasta.mini \ -exp=../in/Gcm1_3732.1_v2_deBruijn.signal \ -normalize=4 -min_L=10 -max_L=15 -max_loop=1 -flank_sequence=5 \ -dependence=1 -seed=1 \ -parallel -user_cores=6 \ -information_content_threshold=0 \ -max_evidence=2 -max_iteration=200) 2> ../out/demo2_out_bpf_t1/error.demo2_bpf echo "Generating dinucleotide heatmaps" input_dir="../out/demo2_out_bpf_t1" file_pattern="*.interct" for file in "$input_dir"/$file_pattern; do if [[ -f $file ]]; then echo "Processing file: $file" bpf shape_heatmap -sf="$file" else echo "No matching .interct files found in $input_dir" fi done Output --------------- In the above example, the motif length is set to a minimum of 10 bases and a maximum of 15 bases, and parallel processing is enabled using the -user_cores=6 option. Therefore, the output includes the PSAM matrix and interaction matrix for each motif length, as well as the corresponding log files. In addition, a heat map is generated based on the interaction matrix for each motif length to describe the interactions between dinucleotides in a visual way. Take a motif_length = 10 as an example to show the output results. .. code-block:: bash out/demo2_out_bpf_t1 |- Gcm1_3732.1_v2_deBruijn.signal_mlpout_L10_1.mlp |- Gcm1_3732.1_v2_deBruijn.signal_Wij_L10_1.interct |- Gcm1_3732.1_v2_deBruijn.signal_Wij_L10_1.interct.png |- log_S1_L10 We can see that in addition to the .mlp file output includes the .interct file and the .png file, which showing the interaction of dinucleotides The .interct file is as follows: .. csv-table:: Gcm1_3732.1_v2_deBruijn.signal_Wij_L10_1.interct :header: "Position", "AA", "CA", "GA", "TA", "AC", "CC", "GC", "TC", "AG", "CG", "GG", "TG", "AT", "CT", "GT", "TT" :widths: 15, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20 "1, 2", -0.1839, 0.5667, -0.0791, -0.3174, 0.4036, -1.4229, 0.5429, 0.5358, -0.0944, 0.3169, -0.5147, -0.0546, -0.1259, 0.3461, -0.0248, -0.1497 "2, 3", -0.1990, 0.0050, 0.1606, 0.0722, 0.2437, -0.7533, 0.1562, 0.1383, 0.0228, 0.7213, -0.3275, -0.1176, -0.0363, 0.0791, 0.0281, 0.0086 "3, 4", 0.1015, -0.0917, 0.0412, 0.0380, -0.0357, -0.0407, 0.1022, -0.0335, -0.0458, -0.1985, 0.2065, -0.0027, -0.0336, 0.3146, -0.3516, 0.0417 "4, 5", 0.2113, 0.2792, 0.4463, -0.8162, -0.2361, 0.0155, -0.3162, 0.4467, 0.1113, 0.0777, -0.0489, 0.0201, 0.0962, -0.2653, -0.1225, 0.3306 "5, 6", 1.1588, -0.5396, 0.0662, -0.4105, -1.3276, 0.4479, 0.2765, 0.7198, 0.3222, 0.2200, -0.0162, -0.3504, 0.0253, -0.3279, 0.0932, 0.2499 "6, 7", 0.1606, 0.4440, 0.1442, -0.3421, -0.0085, -0.7673, 0.3789, 0.2534, -0.0880, 0.0473, 0.0530, -0.0157, -0.6421, 0.5196, -0.0345, -0.0828 "7, 8", 0.3251, -0.1055, -0.0734, -0.0989, -0.2770, 0.0713, 0.3103, -0.1284, -0.0192, 0.0301, -0.0796, 0.0200, 0.1731, 0.2080, -0.2732, 0.0076 "8, 9", -0.2821, 0.2749, -0.1789, 0.1559, -0.1264, -0.1123, 0.1400, 0.0637, -0.0672, 0.0722, -0.0547, -0.0094, 0.1615, -0.1401, 0.0775, -0.1552 "9, 10", -0.1739, -0.0223, 0.1040, -0.0507, -0.0535, -0.1507, -0.0158, 0.0602, 0.1584, -0.0267, -0.2948, 0.0419, -0.1878, 0.0043, 0.1842, -0.1372 The dinucleotide interaction heatmap of Gcm1_3732.1_v2_deBruijn.signal_Wij_L10_1.interct is as follows: .. image:: /example_figs/demo2_Gcm1_3732.1_v2_deBruijn.signal_Wij_L10_1.interct.png :alt: Alternative text describing the image :width: 600px :align: center demo3_shape: ============= Considering that local DNA structural properties, such as the geometry of the DNA molecule, can influence protein-DNA interactions, we incorporated these **DNA shape features** into our predictions of protein-DNA binding affinities. And the dinucleotide combination of DNA structure properties is derived from the DiProDB database. .. code-block:: bash sbatch job_demo3 # Alternatively, you can run the individual demos: ./demo3_shapemodel_ap2gamm.sh ./demo3_shapemodel_hnf4.sh ./demo3_shapemodel_ap2gamm_parallel.sh Demo3 includes two types of transcription factors (TFs), ap2gamm and hnf4, which correspond to two sub-demos: demo3_shapemodel_ap2gamm.sh and demo3_shapemodel_hnf4.sh. Additionally, there is a sub-demo, demo3_shapemodel_ap2gamm_parallel.sh, which explores the parallelization options for ap2gamm and is used to test the time efficiency of parallel processing. Taking ap2gamm as an example (demo3_shapemodel_ap2gamm_parallel), here is a basic usage example. Input --------------- 1. demo3_shape/in/ap2gamma.signal 2. demo3_shape/in/ap2gamma.fasta 3. demo3_shape/shape_feature_files/DiProDB_PCs.dat Code ----------- In the demo3_shapemodel_ap2gamm.sh script, we added the **dinuc_transform** parameter to the shape model. .. code-block:: bash echo "Export at: demos/demo3_shape/out/demo3_ap2gamm_out_bpf_t3" echo "Fitting new shape model with parallel and user_cores parameter" (time bpf bayesPI -seq=../in/ap2gamma.fasta \ -exp=../in/ap2gamma.signal \ -out=../out/demo3_ap2gamm_out_bpf_t3/ \ -strand=2 -dependence=1 -normalize=4 -b1_fixed=0 \ -max_iteration=500 -max_evidence=25 \ -max_L=10 -min_L=7 \ -parallel -user_cores=4 \ -dinuc_transform=../shape_feature_files/DiProDB_PCs.dat -max_loop=1 -seed=1 ) 2> ../out/demo3_ap2gamm_out_bpf_t3/error.demo3_bayesPI_shape The following code indicates that we rename the output of shape model, convert predicted di-nucleotide interaction matric to shape features based on an input table of di-nucleotide shape features with **bpf interct2shape** functionality and plot the heatmap of shape features with **bpf shape_heatmap** functionality. .. code-block:: bash mv ../out/demo3_ap2gamm_out_bpf_t3/ap2gamma.signal_mlpout_L7_1.mlp ../out/demo3_ap2gamm_out_bpf_t3/shape_new_L7_1.mlp mv ../out/demo3_ap2gamm_out_bpf_t3/ap2gamma.signal_Wij_L7_1.interct ../out/demo3_ap2gamm_out_bpf_t3/shape_new_L7_1.interct echo "Option1.1: Converting dinucliotide models to shape space" echo "Output: shape_new_L7_1.shape" bpf interct2shape -sff=../shape_feature_files/DiProDB_PCs.dat -imf=../out/demo3_ap2gamm_out_bpf_t3/shape_new_L7_1.interct echo echo "Option1.2: direct shape heatmaps ... " bpf shape_heatmap -sf=../out/demo3_ap2gamm_out_bpf_t3/shape_new_L7_1.shape Output --------------- In the above example, the motif length is set to a minimum of 7 bases and a maximum of 10 bases. Additionally, the parallel options are set with -user_cores=4. Take a motif_length = 7 as an example to show the output results. .. code-block:: bash out/demo3_ap2gamm_out_bpf_t3 |- shape_new_L7_1.mlp |- shape_new_L7_1.interct |- shape_new_L7_1.shape |- shape_new_L7_1.shape.png |- log_S1_L7 We can see that in addition to the .mlp file and .interct file output includes .shape file and the .png file. The .shape file is as follows: .. csv-table:: shape_new_L7_1.shape :header: "Position", "PC1_Major", "PC2_FreeEnergy", "PC3_TwistRoll", "PC4_Minor", "PC5_TiltRoll" :widths: 15, 20, 20, 20, 20, 20 "1, 2", -0.3451, 0.6526, -0.4285, 0.0953, 0.7080 "2, 3", -0.0821, -0.0027, 0.4511, -1.1831, -0.2364 "3, 4", 0.0510, 0.0104, 0.0370, -0.0117, 0.1000 "4, 5", 0.0647, -0.0284, 0.2287, 0.1050, 0.0953 "5, 6", 0.0089, 0.0274, -0.0710, 0.0467, -0.1738 "6, 7", -1.1065, -0.3864, 0.6386, 0.4610, -0.3984 The heatmap of shape features of shape_new_L7_1.shape is as follows: .. image:: /example_figs/demo3_shape_new_L7_1.shape.png :alt: Alternative text describing the image :width: 600px :align: center demo4_methy: ============== Considering that DNA methylation can alter TF binding potential by either enhancing or reducing affinity at specific positions, we incorporate methylation levels to quantify the methylation intensity. This can be applied at each position or across the entire motif, depending on the **methylation_hires** option. A positive number indicates that the position is highly methylated; a negative number indicates that the position is less methylated; close to zero means no effect The job_demo4 contains several sub-demos. The file names include hires, which means calculating the methylation factor of each position, parallel, which means using the parallel option to test the time spent, and seed, which means using different seeds to synthesize the input data. .. code-block:: bash sbatch job_demo4_parallel_hires ./demo4_methylation_hires_parallel.sh sbatch job_demo4_parallel_no_hires ./demo4_methylation_parallel.sh sbatch job_demo4_seed1 ./demo4_methylation_hires_seed1.sh ./demo4_methylation_seed1.sh sbatch job_demo4_seed2 ./demo4_methylation_hires_seed2.sh ./demo4_methylation_seed2.sh sbatch job_demo4_seed3 ./demo4_methylation_hires_seed3.sh ./demo4_methylation_seed3.sh Here, take demo4_methylation_hires_parallel.sh script and demo4_methylation_parallel.sh script as examples. Input --------------- 1. demo4_methy/in/smooth/seed_1_methylation_positive_motif_ctaat.signal 2. demo4_methy/in/smooth/seed_1_methylation_positive_motif_ctaat.fasta 3. demo4_methy/in/smooth/seed_1_methylation_positive_motif_ctaat.methylation Code ----------- In the demo4_methylation_hires_parallel.sh script, we've added the **methylation_data** and **methylation_hires** parameters. and incorporated the **bpf sequence_logo** functionality to visualize sequence markers with **with_methy** flag. .. code-block:: bash echo "1.Fitting the methyalation model with parallel and user_cores parameter" (time bpf bayesPI -seq=../in/smooth/seed_1_methylation_positive_motif_ctaat.fasta \ -exp=../in/smooth/seed_1_methylation_positive_motif_ctaat.signal \ -out=../out_hires/seed_1_methylation_positive_motif_ctaat.fasta/parallel \ -methylation_data=../in/smooth/seed_1_methylation_positive_motif_ctaat.methylation \ -methylation_hires \ -max_L=$Max_LENGTH -min_L=$Min_LENGTH -max_loop=1 \ -seed=$SEED -strand=$STRAND -normalize=$NORMALIZE \ -max_iteration=$ITERATION -max_evidence=$EVIDENCE \ -information_content_threshold=0.1 \ -parallel -user_cores=5) 2> ../out_hires/seed_1_methylation_positive_motif_ctaat.fasta/parallel/error.demo4_parallel ## using the methylation value echo "Enter the entire folder when drawing sequence markers with methylation files" bpf sequence_logo -ff=../out_hires/seed_1_methylation_positive_motif_ctaat.fasta/parallel \ -sn='seed_1_methylation_positive_motif_ctaat.signal' \ -max_L=9 -min_L=5 \ -max_N=1 -min_N=1 \ -with_methy In the demo4_methylation_parallel.sh script, we added the **methylation_data** parameter without the **methylation_hires** flag. .. code-block:: bash echo "1.Fitting the methyalation model with parallel and user_cores parameter" (time bpf bayesPI -seq=../in/smooth/seed_1_methylation_positive_motif_ctaat.fasta \ -exp=../in/smooth/seed_1_methylation_positive_motif_ctaat.signal \ -out=../out/seed_1_methylation_positive_motif_ctaat.fasta/parallel \ -methylation_data=../in/smooth/seed_1_methylation_positive_motif_ctaat.methylation \ -max_L=$Max_LENGTH -min_L=$Min_LENGTH -max_loop=1 \ -seed=$SEED -strand=$STRAND -normalize=$NORMALIZE \ -max_iteration=$ITERATION -max_evidence=$EVIDENCE \ -information_content_threshold=0.1 \ -parallel -user_cores=5) 2> ../out/seed_1_methylation_positive_motif_ctaat.fasta/parallel/error.demo4_parallel Output --------------- Take a motif_length = 7 in the demo4_methylation_hires_parallel.sh script as an example to show the output results. .. code-block:: bash out_hires/seed_1_methylation_positive_motif_ctaat.fasta |- parallel | |- seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1.mlp | |- seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1.methylation_factors | |- seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1_with_methy.jpg | |- seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1.eps | |- log_S3_L7 .. csv-table:: seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1.methylation_factors :header: "methylation_factors" :widths: 10 0.0207718 0.0246429 0.0885084 0.0452167 0.0656626 0.0318169 0.077055 The sequence logo of seed_1_methylation_positive_motif_ctaat.signal_mlpout_L7_1.mlp with methylation_factors is as follows: .. image:: /example_figs/demo4_seed_1_methylation_positive_motif_ctaat_with_methy.jpg :alt: Alternative text describing the image :width: 600px :align: center demo5_affnity: ================== In this demo, we utilize **bayesPI_affinity** tool to estimate the TF binding affinity. There are several steps using different data and parameters to run. Here’s a brief explanation of the steps 5 in this script, using it as an example: Input --------------- 1. demo5_affnity/in/Hnf4a_2640.2_v1_deBruijn.txt_fa 2. demo5_affnity/in/Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp 3. demo5_affnity/in/Hnf4a_2640.2_v2_deBruijn.txt_Int_Wij_L12_1.intect Code ----------- .. code-block:: bash echo "5.Scan a PSM with its predicted di-nucleotide depence matrix on two strands of DNA sequence where 3bp flank sequences are added in both sides. #to compute the affinity" rm ../out/Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp_bindingP_0_randomBackground_100 &> /dev/null ../../bin/$OS/bayesPI_affinity -out=../out/ \ -seq=../in/Hnf4a_2640.2_v1_deBruijn.txt_fa \ -psam=../in/Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp \ -di_nucfile=../in/Hnf4a_2640.2_v2_deBruijn.txt_Int_Wij_L12_1.intect \ -shuffle=100 -flank_sequence=3 -strand=2 &> /dev/null Output --------------- .. code-block:: bash out |- Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp_bindingP_0 |- Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp_bindingP_0_randomBackground_100 demo6_chemicalPotential: =========================== In this demo, we utilize **bayesPI_chemicalPotential** tool to estimate TF chemical potential. Input --------------- 1. demo6_chemicalPotential/in/Hnf4a_2640.2_v1_deBruijn.txt_fa 2. demo6_chemicalPotential/in/Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp 3. demo6_chemicalPotential/in/Hnf4a_2640.2_v2_deBruijn.txt_Int_Wij_L12_1.intect 4. demo6_chemicalPotential/in/Hnf4a_2640.2_v1_deBruijn.txt_Int Code ----------- .. code-block:: bash echo "Estimate chemical potential for a PSM with di-nulceodtid dependence matrix and add 6bp flank seuquence on both sides of DNA sequences" rm ../out/Hnf4a_2640.2_v1_deBruijn.txt_Int_Wij_L12_4.interct &> /dev/null rm ../out/Hnf4a_2640.2_v1_deBruijn.txt_Int_mlpout_L12_4.mlp &> /dev/null ../../bin/$OS/bayesPI_chemicalPotential -out=../out/ \ -seq=../in/Hnf4a_2640.2_v1_deBruijn.txt_fa \ -psam=../in/Hnf4a_2640.2_v2_deBruijn.txt_Int_mlpout_L12_1.mlp \ -di_nucfile=../in/Hnf4a_2640.2_v2_deBruijn.txt_Int_Wij_L12_1.intect \ -exp=../in/Hnf4a_2640.2_v1_deBruijn.txt_Int \ -max_evidence=4 -flank_sequence=5 &> /dev/null Output --------------- .. code-block:: bash out |- Hnf4a_2640.2_v1_deBruijn.txt_Int_mlpout_L12_4.mlp |- Hnf4a_2640.2_v1_deBruijn.txt_Int_Wij_L12_4.interct