User manual
Both CoMetGeNe.py
(trail finding) and
grouping.py
(trail grouping) provide detailed
usage explanations.
CoMetGeNe.py usage
Run
CoMetGeNe.py -h
to obtain the user manual:
usage: CoMetGeNe.py [-h] [--delta_G NUMBER] [--delta_D NUMBER]
[--timeout SECONDS] [--output OUTPUT] [--skip-import]
ORG DIR
Determines maximum trails of reactions for the specified organisms such that the
genes encoding the enzymes involved in the trails are neighbors.
A trail of reactions is a sequence of reactions that can repeat reactions
(vertices), but not arcs between reactions.
Metabolic pathways and genomic information are automatically retrieved from the
KEGG knowledge base.
Required arguments:
ORG query organism (three- or four-letter KEGG code, e.g.
'eco' for Escherichia coli K-12 MG1655). See full list
of KEGG organism codes at
http://rest.kegg.jp/list/genome
DIR directory storing metabolic pathways for the query
organism ORG or where metabolic pathways for ORG will
be downloaded
Optional arguments:
-h, --help show this help message and exit
--delta_G NUMBER, -dG NUMBER
the NUMBER of genes that can be skipped (default: 0)
--delta_D NUMBER, -dD NUMBER
the NUMBER of reactions that can be skipped (default: 0)
--timeout SECONDS, -t SECONDS
timeout in SECONDS (default: 300)
--output OUTPUT, -o OUTPUT
output file
--skip-import, -s skips importing metabolic pathways from KEGG,
attempting to use locally stored KGML files if they
are present under the specified directory (DIR)
--both-strands, -b considers neighboring genes on both strands of a given
chromosome (by default, only genes located on a single
strand are considered neighbors)
Example: running
python2 CoMetGeNe.py eco data/ -dG 2 -o eco.out
downloads metabolic pathways for species 'eco' to directory 'data/'. Trail
finding is performed, allowing two genes to be skipped at most (-dG 2).
Reactions cannot be skipped (-dD is 0 by default). Maximum trails of reactions
such that the reactions are catalyzed by products of neighboring genes are saved
in the output file 'eco.out'.
grouping.py usage
Run
grouping.py -h
to obtain the user manual:
usage: grouping.py [-h] [--output OUTPUT] {genes,reactions} RESULTS KGML ORG
Groups CoMetGeNe trails by either genes or reactions, optionally producing
a CSV file.
Required arguments:
{genes,reactions} type of trail grouping to perform (possible values:
'genes' or 'reactions')
RESULTS directory storing CoMetGeNe results
KGML directory containing input KGML files
ORG reference species (KEGG organism code)
Optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT
output file (CSV)
KGML needs to contain a subdirectory for every species for which a result file
is present in RESULTS. The subdirectory names need to be the three- or four-
letter KEGG codes for the species in question (e.g., 'bsu', 'eco', 'pae',
etc.). Each species subdirectory is expected to contain metabolic pathways in
KGML format.
Example: running
python2 grouping.py genes results/ data/ eco -o grouping_gene_eco.csv
will perform trail grouping by genes for the reference species 'eco'. The
CoMetGeNe results are stored in 'results/', and the KGML files are available in
'data/'. A CSV file is produced ('grouping_gene_eco.csv').
CoMetGeNe_launcher.py
Although
CoMetGeNe_launcher.py
does not accept command-line arguments, it can
easily be configured to perform trail
finding in parallel by altering a few variables. Examples:
- If you wish to have
CoMetGeNe consider only
genes on the same strand of a chromosome as neighbors, then leave the
following variable unchanged:
both_strands = False
If, however, you wish to take into account genes on both strands when
defining gene neighborhoods, then you would need to set
both_strands to
True.
- If you wish to run trail finding for species
aae,
bbn,
eco, and
mpn
instead of the default 50 bacterial species,
then simply modify the variable
org_code as follows:
org_codes = ['aae', 'bbn', 'eco', 'mpn']
- If you wish to save metabolic pathways in a directory
KGML/
rather than the default
data/,
then simply modify the variable
kgml_dir as follows:
kgml_dir = 'KGML'
- If you wish to save
CoMetGeNe/ results
in a directory
tf_20180701/
rather than the default
results/,
then simply modify the variable
results_dir as follows:
results_dir = 'tf_20180701'
- If you wish to allow
CoMetGeNe to skip
at most 4 genes instead of the default 3, and
at most 2 reactions instead of the default 3,
then simply modify the variables
delta_genes_max and
delta_reactions_max
as follows:
delta_genes_max = 4
delta_reactions_max = 2
Blacklisted pathways
The underlying problem formulation for trail finding is NP-hard
(see our accompanying publication
for more details). This is why
CoMetGeNe.py sets a timeout
(of 5 minutes, by default) for analyzing a given metabolic pathway. If
this timeout is reached without finishing the analysis, then the
pathway in question is blacklisted, i.e. it is added to a list of
exclusions for the species and combination of gap parameters for which
the analysis could not be finished.
The blacklist is a text file called
excluded_pathways.txt,
placed in the
CoMetGeNe/ project root.
If
CoMetGeNe.py
is re-ran, it will no longer attempt to analyze the blacklisted
pathway. If greater values of the gap parameters
with respect to a blacklisted pathway are given to
CoMetGeNe.py for a novel
execution, then the pathway in question will not be analyzed.
For example, if the blacklist contains the entry
cpe 2 2 00500
then the pathway 00500 will
not be analyzed for species
cpe when
CoMetGeNe.py is ran with
gap parameters (
δD,
δG)
set to (2, 2), (2, 3), (3, 2), (3, 3), and so on.