PAKAP

INTRODUCTION (from paper?)

We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.

PAKAP consists of a series of scripts written in Perl, Python and Bash scripts and requires Jellyfish [Marcais 2011] as well as optionally SOAPaligner. The scripts are freely available for non-commercial use.

What does PAKAP depend on?

Jellyfish for fast kmer counting
Some non-standard Perl modules:
- bioperl
  - Bio::SeqIO
  - Bio::SearchIO
- Parallel::ForkManager
- Statistics::Descriptive
- Config::IniFiles
- GD::Graph::linespoints (for the script identifyKmerSize)
Optional: SOAPaligner

Download

Latest Version 1.0:
- INSERT LINK

How to install?

Download the [link_here DiffKAP package].

How to run?

Create your project configuration file by using the example config file. Here it is:

[PIPELINE]
# This should be the path to the directory containing the pipeline scripts.
SCRIPTS_PATH = /ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline
# OPTIONAL: Path to reference
REFERENCE_PATH = /ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/A.thaliana.whole.fasta
# OPTIONAL: Path to gene annotation on reference
GFF_PATH =

[JELLYFISH]
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/)
PATH = /work2/NCISF/MAS/W26/jarrah/apps/bin
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5
MIN_KMER_SIZE=5
# The maximum k-mer size will be checked when KMER_SIZE=AUTO, jellyfish limits to the maximum k-mer size to 31. Default is 22
MAX_KMER_SIZE=22

[IDENTIFYKMERSIZE]
# Number of CPU will be used. Be aware of that the total amount of memory will be shared among all CPUs.
NUM_OF_PROCESSOR=4
# treatment 1 ID which will be used for naming files 
T1_ID=CON
# treatment 2 ID which will be used for naming files 
T2_ID=STE
OUT_DIR=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/

# directory storing treatment 1 data files. The data files can be fasta or fastq formats.
DATA_DIR_T1=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/whole/whole_2.150.1sd/
# directory storing treatment 2 data files. The data files can be fasta or fastq formats.
DATA_DIR_T2=/ebi/bscratch/jarrah/uqphilippb/2014_4_17_FixPAV/CSR_pipeline/truncated/truncated_2.150.1sd/

[ADVANCED]
# Important setting in jellyfish for tuning jellyfish performance. A larger hash size, more memory will be used 
#   but less sub-count files will be generated. Default is 10000000.
JELLYFISH_HASH_SIZE=10000000
# Size of the jellyfish hash table. Use a size large enough to contain all of the K-mers such that 80% * s > number of distinct K-mers. [Default: 10000000].
#jellyfish_hash_size = 16G
# Length of counter in the hash table. See jellyfish manual for explanation. [Default: 4]
JELLYFISH_COUNTER_BITS = 18
# Number of bases used for splitting the k-mer files into sub-files. A larger number reduces RAM usage. Maximum at the moment is 3. [Default: 2]
N_SPLIT_BASES = 3
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4]
MIN_OCC = 4

The command is:

   python pipeline.py 
        --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R1.fasta ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R2.fasta 
        --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R1.fasta ./whole/whole_2.150.1sd/whole_2.150.1sd.R2.fasta 
        -c default.config  
        -o output_folder/

or, easier:

   python pipeline.py 
        --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R?.fasta
        --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R?.fasta
        -c default.config  
        -o output_folder/

This will read the configuration from default.config and generate all output files in output_folder.

How to interpret the results?

You can download the results of the sample data here.

FAQ

Reference

Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.

Back to Main_Page

PAKAP

Contents

What does PAKAP depend on?

Download

How to install?

How to run?

How to interpret the results?

FAQ

Reference

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools