bayesPI

Arguments

Required arguments:

  • -seq: Input sequence file in FASTA format.

  • -exp: Input expression data file in tab-delimited format.

Optional arguments:

  • -out: Output directory (default: current working directory).

  • -normalize: Specifies the normalization method for input expression data, with the following possible values. 0: Z-score transformation of log expression data (default). 1: Log transformation of expression data. 2: No normalization applied to the input data. 3: Linear transform + log transform + Z-score normalization. 4: Z-score normalization + linear transform + log transform + Z-score normalization again.

    • Notes: Use mode 0 when the -exp file contains values that represent ratios (e.g., relative enrichment from a SELEX experiment). Use mode 2 when the data has already been normalized by other means. Use mode 4 for data representing counts (e.g., ChIP-seq read counts).

  • -initial_beta: A list of initial values for the beta hyperparameter used in each motif fitting iteration (up to -max_loop). If the list contains fewer values than the number of iterations, the last value will be reused for the remaining iterations. Lowering this value increases regularization strength. Default: 50.

  • -constrain: Determines the type of regression algorithm used, with 0 for the default Bayesian regression and 1 for a constraint-based regression algorithm.

  • -dependence: Represents the dependence model for nucleotide weight matrices. 0: Non-dependence in nucleotide weight matrice (default). 1: Dependence in nucleotide weight matrice (adjacent nucleotide-dependence). 2: Using non-dependent model to get initial seed motif then start the fitting of dependent model of di-nucleotides interactions. 3: Loading the independent matrix from an .mlp file specified with -psam option and fit only the dependent part of the model.

    • Notes: Mode 3 will be switched to mode 2 after the first model is fit if -max_loop > 1. The dependency model will be stored in a file with “.interct” extension.

  • -psam: When specified, this .mlp file (representing a Position-Specific Affinity Matrix) will be used to initialize the independent matrix instead of random values. This parameter is required for -dependence=3.

  • -rsat_transform: When specified, the PWM (Position Weight Matrix) loaded with -psam will be transformed using the method from RSAT tools.

  • -strand: Used to specify which strand of the sequence to scan. 0: Scan the forward strand sequence (default, input sequence). 1: Scan the reverse complement of the input sequence. 2: Scan both forward and reverse strand sequences.

  • -p_value: Threshold to stop looking for new PSAMs (default 0.001)

  • -max_L: Maximum length of PSAM (default: the length of loaded PWM, or 8 if -psam is not given)

  • -min_L: Minimum length of PSAM (default: the length of loaded PWM, or 8 if -psam is not given)

  • -max_loop: Specifies the maximum number of iterations allowed for each position in the PSAM. (default 6).

  • -max_iteration: Maximum number of iteration in SCG algorithm (default 500)

  • -max_evidence: Maximum number of Bayesian Evidence estimations (default 3)

  • -seed: Sets the random seed for initializing parameters. Default is 0, which uses a time-based seed for randomization.

  • -split_sequences: 1 (default): Splits all input sequences into fragments containing only ACGT letters. Non-ACGT characters act as delimiters. 0: Removes all non-ACGT characters from the input sequences.

  • -information_content_threshold: A threshold value between 0 and 1 that defines the minimum information content required for a fitted model to be considered valid. If the maximum information content of the model is below this threshold, it will not be exported. Default: 0.5.

  • -init_zero: Initializes independent weights close to 0. If not specified, the initial values for the independent weights will be chosen near positive values, typically in the range of 0.5 to 1, depending on the matrix size.

  • -b1_fixed: If set to 1 (default), the chemical potential (b1) will remain fixed when fitting the dependent part of the model in -dependence=2 mode. If set to 0, b1 will be optimized along with the dependent weights.

  • -dinuc_transform: Applies the linear transformation specified in the file to the dinucleotide parameters. This option is only relevant when -dependence=1, 2, or 3 is used. The file must be a tab-separated table with a header, where the first column lists the 16 possible dinucleotides, and the subsequent columns contain corresponding values for each dinucleotide. These values will be standardized. The gradient descent will be performed using the dinucleotide matrix transformed by the given matrix. For example, a DNA shape model can be fit using this mode, if the values in the table are the values of DNA shape features for each dinucleotide.

  • -flank_motif: Used in conjunction with the -psam option to add flanking base pairs to the loaded matrix. The total length of the motif will be L + 2 * F, where L is the length of the loaded matrix and F given by this option (default: 0)

  • -flank_sequence: Adds flanking sequences to both sides of the DNA sequence. The default value is 0, which means no flanking nucleotides are added.

  • methylation_normalize: 0: Z-score transformation of log-transformed expression data (default). 1: Normalization using a cutoff approach. 2: Use raw data without any normalization.

  • -methylation_data: Loads and uses the file containing methylation levels for each nucleotide. The file should have one line per sequence, where each line starts with the sequence name followed by a tab-separated list of real numbers representing the methylation levels of each nucleotide in the sequence. When specified, the model will add an additional weight at each motif position, reflecting the affinity of that position for methylation. These factors will be exported in a file with the “.methylation_factors” extension.

  • -methylation_hires: Uses the position-dependent methylation model. By default, the summary methylation model is used.

  • -cutoff_value: Default: NAN. This only affects methylation_data if methylation_normalize=1 is set. If specified as a value between 0 and 1, the program will change methylation levels to either 0 or 1 based on the provided cut-off threshold.

  • -fermi_dirac: Switches on or off Fermi-Dirac function in mlp_forward calculation for non-chemical potential. Default: off (false), using non-Fermi-Dirac function.

  • -parallel: Enable parallelization to speed up the execution by distributing tasks across multiple processors. When specified, the program will attempt to divide the workload and run in parallel on available cores.

  • -user_cores: Specify the number of CPU cores to utilize for parallel processing. By default, the program will use only one core.

  • -h: Prints out this help message