Difference between revisions of "PAKAP"
From Applied Bioinformatics Group
Philippbayer (talk | contribs) (→How to run?) |
Philippbayer (talk | contribs) (→How to run?) |
||
Line 34: | Line 34: | ||
[PIPELINE] | [PIPELINE] | ||
# This should be the path to the directory containing the pipeline scripts. | # This should be the path to the directory containing the pipeline scripts. | ||
− | SCRIPTS_PATH = / | + | SCRIPTS_PATH = /scratch/pipeline |
# OPTIONAL: Path to reference | # OPTIONAL: Path to reference | ||
− | REFERENCE_PATH = / | + | REFERENCE_PATH = /scratch/temp/ref.fasta |
# OPTIONAL: Path to gene annotation on reference | # OPTIONAL: Path to gene annotation on reference | ||
− | GFF_PATH = | + | GFF_PATH = /scratch/temp/ref.gff3 |
[JELLYFISH] | [JELLYFISH] | ||
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) | # The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) | ||
− | PATH = / | + | PATH = /scratch/bin |
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 | # The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 | ||
MIN_KMER_SIZE=5 | MIN_KMER_SIZE=5 | ||
Line 55: | Line 55: | ||
# treatment 2 ID which will be used for naming files | # treatment 2 ID which will be used for naming files | ||
T2_ID=STE | T2_ID=STE | ||
− | OUT_DIR=/ | + | # The output directory of the initial kmer counting, for jellyfish |
+ | OUT_DIR=/scratch/somewhere | ||
# directory storing treatment 1 data files. The data files can be fasta or fastq formats. | # directory storing treatment 1 data files. The data files can be fasta or fastq formats. | ||
− | DATA_DIR_T1=/ | + | DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/ |
# directory storing treatment 2 data files. The data files can be fasta or fastq formats. | # directory storing treatment 2 data files. The data files can be fasta or fastq formats. | ||
− | DATA_DIR_T2=/ | + | DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/ |
[ADVANCED] | [ADVANCED] | ||
Line 74: | Line 75: | ||
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] | # Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] | ||
MIN_OCC = 4</nowiki> | MIN_OCC = 4</nowiki> | ||
+ | |||
+ | The most important settings are: | ||
+ | |||
+ | * SCRIPTS_PATH - this is where you extracted the pipeline | ||
+ | * REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified. | ||
+ | * GFF_PATH - optional: The SOAP results will be compared against genes if this is specified. | ||
+ | * JELLYFISH / PATH - the directory in which Jellyfish is located | ||
+ | * IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge) | ||
+ | * DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located | ||
The command is: | The command is: |
Revision as of 05:28, 28 April 2014
INTRODUCTION (from paper?)
We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.
PAKAP consists of a series of scripts written in Perl, Python and Bash scripts and requires Jellyfish [Marcais 2011] as well as optionally SOAPaligner. The scripts are freely available for non-commercial use.
Contents
What does PAKAP depend on?
- Jellyfish for fast kmer counting
- Some non-standard Perl modules:
- bioperl
- Bio::SeqIO
- Bio::SearchIO
- Parallel::ForkManager
- Statistics::Descriptive
- Config::IniFiles
- GD::Graph::linespoints (for the script identifyKmerSize)
- bioperl
- Optional: SOAPaligner
Download
- Latest Version 1.0:
- INSERT LINK
How to install?
- Download the [link_here DiffKAP package].
How to run?
- Create your project configuration file by using the example config file. Here it is:
[PIPELINE] # This should be the path to the directory containing the pipeline scripts. SCRIPTS_PATH = /scratch/pipeline # OPTIONAL: Path to reference REFERENCE_PATH = /scratch/temp/ref.fasta # OPTIONAL: Path to gene annotation on reference GFF_PATH = /scratch/temp/ref.gff3 [JELLYFISH] # The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) PATH = /scratch/bin # The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 MIN_KMER_SIZE=5 # The maximum k-mer size will be checked when KMER_SIZE=AUTO, jellyfish limits to the maximum k-mer size to 31. Default is 22 MAX_KMER_SIZE=22 [IDENTIFYKMERSIZE] # Number of CPU will be used. Be aware of that the total amount of memory will be shared among all CPUs. NUM_OF_PROCESSOR=4 # treatment 1 ID which will be used for naming files T1_ID=CON # treatment 2 ID which will be used for naming files T2_ID=STE # The output directory of the initial kmer counting, for jellyfish OUT_DIR=/scratch/somewhere # directory storing treatment 1 data files. The data files can be fasta or fastq formats. DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/ # directory storing treatment 2 data files. The data files can be fasta or fastq formats. DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/ [ADVANCED] # Important setting in jellyfish for tuning jellyfish performance. A larger hash size, more memory will be used # but less sub-count files will be generated. Default is 10000000. JELLYFISH_HASH_SIZE=10000000 # Size of the jellyfish hash table. Use a size large enough to contain all of the K-mers such that 80% * s > number of distinct K-mers. [Default: 10000000]. #jellyfish_hash_size = 16G # Length of counter in the hash table. See jellyfish manual for explanation. [Default: 4] JELLYFISH_COUNTER_BITS = 18 # Number of bases used for splitting the k-mer files into sub-files. A larger number reduces RAM usage. Maximum at the moment is 3. [Default: 2] N_SPLIT_BASES = 3 # Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] MIN_OCC = 4
The most important settings are:
- SCRIPTS_PATH - this is where you extracted the pipeline
- REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified.
- GFF_PATH - optional: The SOAP results will be compared against genes if this is specified.
- JELLYFISH / PATH - the directory in which Jellyfish is located
- IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge)
- DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located
The command is:
python pipeline.py --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R1.fasta ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R2.fasta --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R1.fasta ./whole/whole_2.150.1sd/whole_2.150.1sd.R2.fasta -c default.config -o output_folder/
or, easier:
python pipeline.py --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R?.fasta --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R?.fasta -c default.config -o output_folder/
This will read the configuration from default.config and generate all output files in output_folder.
How to interpret the results?
- You can download the results of the sample data here.
FAQ
Reference
- Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
Back to Main_Page