Difference between revisions of "PAKAP"

From Applied Bioinformatics Group
Jump to: navigation, search
(How to run?)
(How to run?)
Line 34: Line 34:
 
[PIPELINE]
 
[PIPELINE]
 
# This should be the path to the directory containing the pipeline scripts.
 
# This should be the path to the directory containing the pipeline scripts.
SCRIPTS_PATH = /ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline
+
SCRIPTS_PATH = /scratch/pipeline
 
# OPTIONAL: Path to reference
 
# OPTIONAL: Path to reference
REFERENCE_PATH = /ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/A.thaliana.whole.fasta
+
REFERENCE_PATH = /scratch/temp/ref.fasta
 
# OPTIONAL: Path to gene annotation on reference
 
# OPTIONAL: Path to gene annotation on reference
GFF_PATH =
+
GFF_PATH = /scratch/temp/ref.gff3
  
 
[JELLYFISH]
 
[JELLYFISH]
 
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/)
 
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/)
PATH = /work2/NCISF/MAS/W26/jarrah/apps/bin
+
PATH = /scratch/bin
 
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5
 
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5
 
MIN_KMER_SIZE=5
 
MIN_KMER_SIZE=5
Line 55: Line 55:
 
# treatment 2 ID which will be used for naming files  
 
# treatment 2 ID which will be used for naming files  
 
T2_ID=STE
 
T2_ID=STE
OUT_DIR=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/
+
# The output directory of the initial kmer counting, for jellyfish
 +
OUT_DIR=/scratch/somewhere
  
 
# directory storing treatment 1 data files. The data files can be fasta or fastq formats.
 
# directory storing treatment 1 data files. The data files can be fasta or fastq formats.
DATA_DIR_T1=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/whole/whole_2.150.1sd/
+
DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/
 
# directory storing treatment 2 data files. The data files can be fasta or fastq formats.
 
# directory storing treatment 2 data files. The data files can be fasta or fastq formats.
DATA_DIR_T2=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/truncated/truncated_2.150.1sd/
+
DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/
  
 
[ADVANCED]
 
[ADVANCED]
Line 74: Line 75:
 
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4]
 
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4]
 
MIN_OCC = 4</nowiki>
 
MIN_OCC = 4</nowiki>
 +
 +
The most important settings are:
 +
 +
* SCRIPTS_PATH - this is where you extracted the pipeline
 +
* REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified.
 +
* GFF_PATH - optional: The SOAP results will be compared against genes if this is specified.
 +
* JELLYFISH / PATH - the directory in which Jellyfish is located
 +
* IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge)
 +
* DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located
  
 
The command is:
 
The command is:

Revision as of 05:28, 28 April 2014

INTRODUCTION (from paper?)


We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.

PAKAP consists of a series of scripts written in Perl, Python and Bash scripts and requires Jellyfish [Marcais 2011] as well as optionally SOAPaligner. The scripts are freely available for non-commercial use.


What does PAKAP depend on?

  • Jellyfish for fast kmer counting
  • Some non-standard Perl modules:
    • bioperl
      • Bio::SeqIO
      • Bio::SearchIO
    • Parallel::ForkManager
    • Statistics::Descriptive
    • Config::IniFiles
    • GD::Graph::linespoints (for the script identifyKmerSize)
  • Optional: SOAPaligner

Download

  • Latest Version 1.0:
    • INSERT LINK

How to install?

  • Download the [link_here DiffKAP package].


How to run?

  • Create your project configuration file by using the example config file. Here it is:
[PIPELINE]
# This should be the path to the directory containing the pipeline scripts.
SCRIPTS_PATH = /scratch/pipeline
# OPTIONAL: Path to reference
REFERENCE_PATH = /scratch/temp/ref.fasta
# OPTIONAL: Path to gene annotation on reference
GFF_PATH = /scratch/temp/ref.gff3

[JELLYFISH]
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/)
PATH = /scratch/bin
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5
MIN_KMER_SIZE=5
# The maximum k-mer size will be checked when KMER_SIZE=AUTO, jellyfish limits to the maximum k-mer size to 31. Default is 22
MAX_KMER_SIZE=22

[IDENTIFYKMERSIZE]
# Number of CPU will be used. Be aware of that the total amount of memory will be shared among all CPUs.
NUM_OF_PROCESSOR=4
# treatment 1 ID which will be used for naming files 
T1_ID=CON
# treatment 2 ID which will be used for naming files 
T2_ID=STE
# The output directory of the initial kmer counting, for jellyfish
OUT_DIR=/scratch/somewhere

# directory storing treatment 1 data files. The data files can be fasta or fastq formats.
DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/
# directory storing treatment 2 data files. The data files can be fasta or fastq formats.
DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/

[ADVANCED]
# Important setting in jellyfish for tuning jellyfish performance. A larger hash size, more memory will be used 
#   but less sub-count files will be generated. Default is 10000000.
JELLYFISH_HASH_SIZE=10000000
# Size of the jellyfish hash table. Use a size large enough to contain all of the K-mers such that 80% * s > number of distinct K-mers. [Default: 10000000].
#jellyfish_hash_size = 16G
# Length of counter in the hash table. See jellyfish manual for explanation. [Default: 4]
JELLYFISH_COUNTER_BITS = 18
# Number of bases used for splitting the k-mer files into sub-files. A larger number reduces RAM usage. Maximum at the moment is 3. [Default: 2]
N_SPLIT_BASES = 3
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4]
MIN_OCC = 4

The most important settings are:

  • SCRIPTS_PATH - this is where you extracted the pipeline
  • REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified.
  • GFF_PATH - optional: The SOAP results will be compared against genes if this is specified.
  • JELLYFISH / PATH - the directory in which Jellyfish is located
  • IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge)
  • DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located

The command is:

   python pipeline.py 
        --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R1.fasta ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R2.fasta 
        --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R1.fasta ./whole/whole_2.150.1sd/whole_2.150.1sd.R2.fasta 
        -c default.config  
        -o output_folder/

or, easier:

   python pipeline.py 
        --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R?.fasta
        --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R?.fasta
        -c default.config  
        -o output_folder/

This will read the configuration from default.config and generate all output files in output_folder.

How to interpret the results?

  • You can download the results of the sample data here.


FAQ


Reference

  • Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.


Back to Main_Page