Difference between revisions of "PAKAP"
Philippbayer (talk | contribs) (→How to run?) |
|||
(33 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
− | INTRODUCTION | + | INTRODUCTION |
+ | Presence/Absence variants (PAVs) have recently been shown to be abundant in plant and fungi genomes. Genome wide PAV detection is usually performed during the construction of pangenomes for species, using either the whole genome assembly and comparison or the iterative mapping and assembly approach. These rely on the presence of at least one reference genome, and can be confounded by variations in assembly and annotation quality. | ||
We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis. | We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis. | ||
Line 7: | Line 8: | ||
− | == What | + | == What PAKAP depends on == |
* [http://www.cbcb.umd.edu/software/jellyfish Jellyfish] for fast kmer counting | * [http://www.cbcb.umd.edu/software/jellyfish Jellyfish] for fast kmer counting | ||
* Some non-standard Perl modules: | * Some non-standard Perl modules: | ||
− | ** | + | ** BioPerl |
*** Bio::SeqIO | *** Bio::SeqIO | ||
*** Bio::SearchIO | *** Bio::SearchIO | ||
Line 18: | Line 19: | ||
** Config::IniFiles | ** Config::IniFiles | ||
** GD::Graph::linespoints (for the script identifyKmerSize) | ** GD::Graph::linespoints (for the script identifyKmerSize) | ||
+ | * Python 2.7. The pipeline runs under Python 3, but is not tested extensively. | ||
+ | ** BioPython | ||
* Optional: [http://soap.genomics.org.cn/soapaligner.html SOAPaligner] | * Optional: [http://soap.genomics.org.cn/soapaligner.html SOAPaligner] | ||
== Download == | == Download == | ||
− | * Latest Version 1. | + | * Latest Version 1.7: |
− | ** | + | ** [http://appliedbioinformatics.com.au/download/PAKAP_1.7.zip PAKAP_1.7.zip] |
− | + | * Example data (244MB) | |
− | * | + | ** [http://appliedbioinformatics.com.au/download/PAKAP_example_data.zip PAKAP_example_data.zip] |
+ | == How to install == | ||
+ | * Download the package from above, then just extract the files to a directory of your choosing. Remember the directory, since you'll need that in the next step. | ||
− | == How to | + | == How to prepare the pipeline == |
− | + | ||
+ | '''IMPORTANT:''' Currently, only single-line FASTA files are accepted by the pipeline, in other words, one line per identifier, one line per sequence. If you have FASTQ files, the included script FixFASTQ.py can transform them into the needed format. Usage: | ||
+ | |||
+ | python FixFASTQ.py your_reads.fastq single_line_reads.fasta | ||
+ | |||
+ | Create your project configuration file by using the example config file. Here it is: | ||
<nowiki> | <nowiki> | ||
[PIPELINE] | [PIPELINE] | ||
# This should be the path to the directory containing the pipeline scripts. | # This should be the path to the directory containing the pipeline scripts. | ||
− | SCRIPTS_PATH = / | + | SCRIPTS_PATH = /scratch/pipeline |
# OPTIONAL: Path to reference | # OPTIONAL: Path to reference | ||
− | REFERENCE_PATH = / | + | REFERENCE_PATH = /scratch/temp/ref.fasta |
# OPTIONAL: Path to gene annotation on reference | # OPTIONAL: Path to gene annotation on reference | ||
− | GFF_PATH = | + | GFF_PATH = /scratch/temp/ref.gff3 |
[JELLYFISH] | [JELLYFISH] | ||
# The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) | # The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) | ||
− | PATH = / | + | PATH = /scratch/bin |
# The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 | # The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 | ||
MIN_KMER_SIZE=5 | MIN_KMER_SIZE=5 | ||
Line 55: | Line 65: | ||
# treatment 2 ID which will be used for naming files | # treatment 2 ID which will be used for naming files | ||
T2_ID=STE | T2_ID=STE | ||
− | OUT_DIR=/ | + | # The output directory of the initial kmer counting, for jellyfish |
+ | OUT_DIR=/scratch/somewhere | ||
# directory storing treatment 1 data files. The data files can be fasta or fastq formats. | # directory storing treatment 1 data files. The data files can be fasta or fastq formats. | ||
− | DATA_DIR_T1=/ | + | DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/ |
# directory storing treatment 2 data files. The data files can be fasta or fastq formats. | # directory storing treatment 2 data files. The data files can be fasta or fastq formats. | ||
− | DATA_DIR_T2=/ | + | DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/ |
[ADVANCED] | [ADVANCED] | ||
Line 73: | Line 84: | ||
N_SPLIT_BASES = 3 | N_SPLIT_BASES = 3 | ||
# Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] | # Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] | ||
− | MIN_OCC = 4<nowiki> | + | MIN_OCC = 4</nowiki> |
+ | |||
+ | The most important settings are: | ||
+ | |||
+ | * SCRIPTS_PATH - this is where you extracted the pipeline | ||
+ | * REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified. | ||
+ | * GFF_PATH - optional: The SOAP results will be compared against genes if this is specified. | ||
+ | * JELLYFISH / PATH - the directory in which Jellyfish is located | ||
+ | * IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge) | ||
+ | * DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located | ||
+ | |||
+ | == How to run the pipeline == | ||
+ | |||
+ | The pipeline takes four arguments: 's1' specifies reads from the first group (or treatment etc.), 's2' specifies reads from the second group, 'c' is the path to the configuration file, and 'o' specifies the path to the directory which will contain all results. Per condition s1 or s2, unlimited files are possible. | ||
The command is: | The command is: | ||
Line 91: | Line 115: | ||
-o output_folder/ | -o output_folder/ | ||
− | This will read the configuration from default.config and generate all output files in output_folder. | + | This will read the configuration from default.config and generate all output files, as well as a log-file, in output_folder. |
+ | |||
+ | == How to interpret the results == | ||
+ | You can download the results of the sample data [link_here here]. | ||
+ | |||
+ | In the given out_dir, between two and four folders will be created. | ||
+ | |||
+ | 1. kmers - these are the k-mers present in s1 only, and in s2 only. K-mers present in s1 only start with s1. | ||
+ | |||
+ | 2. PARs - this folder contains the reads present only in one of the two conditions: | ||
+ | |||
+ | 2.1 ''s1_0_PARs_single.fasta'' - reads only present in s1 | ||
+ | |||
+ | 2.2 ''s2_0_PARs_single.fasta'' - reads only present in s2 | ||
+ | |||
+ | 3. SOAPs - optional, only when reference is specified: The SOAP results for aligning the reads only present in s1 or s2. Can be converted to SAM/BAM using SOAPaligner's soap2sam.pl and samtools view. | ||
+ | |||
+ | 4. Results - optional, only when gff file and reference are specified: A table detailing how many genes there are in the given gff-file, how many reads were produced by the pipeline, how many reads could be aligned, how many reads hit inside the genes in the gff3 file, how many reads hit only with their start or their end inside genes, how many reads hit multiple times etc. Conditions that have no reads produced (like in the example data) are listed like with '-'. Example: | ||
+ | |||
− | = | + | {| class="wikitable" |
− | + | !Name | |
+ | !Number_of_genes | ||
+ | !Aligned_reads | ||
+ | !Complete_hit | ||
+ | !Only_start | ||
+ | !Only_end | ||
+ | !Number_alignments_without_hit_to_gene | ||
+ | !Duplicate_hits | ||
+ | |- | ||
+ | |chr1.232_genes_vs_truncated_10.100.1sd_s2_0 | ||
+ | |232 | ||
+ | |2670 | ||
+ | |2655 | ||
+ | |8 | ||
+ | |7 | ||
+ | |0 | ||
+ | |0 | ||
+ | |- | ||
+ | |non_existing_condition | ||
+ | | - | ||
+ | | - | ||
+ | | - | ||
+ | | - | ||
+ | | - | ||
+ | | - | ||
+ | | - | ||
+ | |} | ||
+ | |||
+ | Here, the large majority of produced reads aligns completely inside the given 232 genes (2655 out of 2670 reads). 8 reads align only with their start and then stretch outside the gene, 7 reads start before the gene and then end inside the gene. 0 of the reads align somewhere else, and all reads align uniquely. | ||
+ | |||
+ | Here's one way to have all tables in one big table: | ||
+ | |||
+ | # first, get the header | ||
+ | head -n 1 some_results/gff_compared_to_soap_results.txt > Sorted_final_results.txt | ||
+ | # Append the sorted results from all other tables | ||
+ | find . -name gff_compared_to_soap_results.txt -exec tail -n +2 {} \; | sort >> Sorted_final_results.txt | ||
== FAQ == | == FAQ == | ||
Line 104: | Line 181: | ||
== Reference == | == Reference == | ||
* Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | * Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | ||
− | |||
Back to [[Main_Page]] | Back to [[Main_Page]] |
Latest revision as of 07:07, 2 May 2018
INTRODUCTION
Presence/Absence variants (PAVs) have recently been shown to be abundant in plant and fungi genomes. Genome wide PAV detection is usually performed during the construction of pangenomes for species, using either the whole genome assembly and comparison or the iterative mapping and assembly approach. These rely on the presence of at least one reference genome, and can be confounded by variations in assembly and annotation quality.
We have established a Present/Absent Kmer Analysis Pipeline (PAKAP). By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.
PAKAP consists of a series of scripts written in Perl, Python and Bash scripts and requires Jellyfish [Marcais 2011] as well as optionally SOAPaligner. The scripts are freely available for non-commercial use.
Contents
What PAKAP depends on
- Jellyfish for fast kmer counting
- Some non-standard Perl modules:
- BioPerl
- Bio::SeqIO
- Bio::SearchIO
- Parallel::ForkManager
- Statistics::Descriptive
- Config::IniFiles
- GD::Graph::linespoints (for the script identifyKmerSize)
- BioPerl
- Python 2.7. The pipeline runs under Python 3, but is not tested extensively.
- BioPython
- Optional: SOAPaligner
Download
- Latest Version 1.7:
- Example data (244MB)
How to install
- Download the package from above, then just extract the files to a directory of your choosing. Remember the directory, since you'll need that in the next step.
How to prepare the pipeline
IMPORTANT: Currently, only single-line FASTA files are accepted by the pipeline, in other words, one line per identifier, one line per sequence. If you have FASTQ files, the included script FixFASTQ.py can transform them into the needed format. Usage:
python FixFASTQ.py your_reads.fastq single_line_reads.fasta
Create your project configuration file by using the example config file. Here it is:
[PIPELINE] # This should be the path to the directory containing the pipeline scripts. SCRIPTS_PATH = /scratch/pipeline # OPTIONAL: Path to reference REFERENCE_PATH = /scratch/temp/ref.fasta # OPTIONAL: Path to gene annotation on reference GFF_PATH = /scratch/temp/ref.gff3 [JELLYFISH] # The path to the local installation of jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) PATH = /scratch/bin # The minimum k-mer size will be checked when KMER_SIZE=AUTO. Default is 5 MIN_KMER_SIZE=5 # The maximum k-mer size will be checked when KMER_SIZE=AUTO, jellyfish limits to the maximum k-mer size to 31. Default is 22 MAX_KMER_SIZE=22 [IDENTIFYKMERSIZE] # Number of CPU will be used. Be aware of that the total amount of memory will be shared among all CPUs. NUM_OF_PROCESSOR=4 # treatment 1 ID which will be used for naming files T1_ID=CON # treatment 2 ID which will be used for naming files T2_ID=STE # The output directory of the initial kmer counting, for jellyfish OUT_DIR=/scratch/somewhere # directory storing treatment 1 data files. The data files can be fasta or fastq formats. DATA_DIR_T1=/scratch/somewhere/whole/whole_2.150.1sd/ # directory storing treatment 2 data files. The data files can be fasta or fastq formats. DATA_DIR_T2=/scratch/somewhere/truncated/truncated_2.150.1sd/ [ADVANCED] # Important setting in jellyfish for tuning jellyfish performance. A larger hash size, more memory will be used # but less sub-count files will be generated. Default is 10000000. JELLYFISH_HASH_SIZE=10000000 # Size of the jellyfish hash table. Use a size large enough to contain all of the K-mers such that 80% * s > number of distinct K-mers. [Default: 10000000]. #jellyfish_hash_size = 16G # Length of counter in the hash table. See jellyfish manual for explanation. [Default: 4] JELLYFISH_COUNTER_BITS = 18 # Number of bases used for splitting the k-mer files into sub-files. A larger number reduces RAM usage. Maximum at the moment is 3. [Default: 2] N_SPLIT_BASES = 3 # Minimum occurrence count for a k-mer to be considered as a candidate Presence / Absence K-mer. [Default: 4] MIN_OCC = 4
The most important settings are:
- SCRIPTS_PATH - this is where you extracted the pipeline
- REFERENCE_PATH - optional: if you have a reference fasta. SOAPaligner will be run if this is specified.
- GFF_PATH - optional: The SOAP results will be compared against genes if this is specified.
- JELLYFISH / PATH - the directory in which Jellyfish is located
- IDENTIFYKMERSIZE / OUT_DIR - the dirctory in which temporary kmers are stored (might get huge)
- DATA_DIR_T1 and DATA_DIR_T2 - where the fasta files for kmer counting are located
How to run the pipeline
The pipeline takes four arguments: 's1' specifies reads from the first group (or treatment etc.), 's2' specifies reads from the second group, 'c' is the path to the configuration file, and 'o' specifies the path to the directory which will contain all results. Per condition s1 or s2, unlimited files are possible.
The command is:
python pipeline.py --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R1.fasta ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R2.fasta --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R1.fasta ./whole/whole_2.150.1sd/whole_2.150.1sd.R2.fasta -c default.config -o output_folder/
or, easier:
python pipeline.py --s1 ./truncated/truncated_2.150.1sd/truncated_2.150.1sd.R?.fasta --s2 ./whole/whole_2.150.1sd/whole_2.150.1sd.R?.fasta -c default.config -o output_folder/
This will read the configuration from default.config and generate all output files, as well as a log-file, in output_folder.
How to interpret the results
You can download the results of the sample data [link_here here].
In the given out_dir, between two and four folders will be created.
1. kmers - these are the k-mers present in s1 only, and in s2 only. K-mers present in s1 only start with s1.
2. PARs - this folder contains the reads present only in one of the two conditions:
2.1 s1_0_PARs_single.fasta - reads only present in s1
2.2 s2_0_PARs_single.fasta - reads only present in s2
3. SOAPs - optional, only when reference is specified: The SOAP results for aligning the reads only present in s1 or s2. Can be converted to SAM/BAM using SOAPaligner's soap2sam.pl and samtools view.
4. Results - optional, only when gff file and reference are specified: A table detailing how many genes there are in the given gff-file, how many reads were produced by the pipeline, how many reads could be aligned, how many reads hit inside the genes in the gff3 file, how many reads hit only with their start or their end inside genes, how many reads hit multiple times etc. Conditions that have no reads produced (like in the example data) are listed like with '-'. Example:
Name | Number_of_genes | Aligned_reads | Complete_hit | Only_start | Only_end | Number_alignments_without_hit_to_gene | Duplicate_hits |
---|---|---|---|---|---|---|---|
chr1.232_genes_vs_truncated_10.100.1sd_s2_0 | 232 | 2670 | 2655 | 8 | 7 | 0 | 0 |
non_existing_condition | - | - | - | - | - | - | - |
Here, the large majority of produced reads aligns completely inside the given 232 genes (2655 out of 2670 reads). 8 reads align only with their start and then stretch outside the gene, 7 reads start before the gene and then end inside the gene. 0 of the reads align somewhere else, and all reads align uniquely.
Here's one way to have all tables in one big table:
# first, get the header head -n 1 some_results/gff_compared_to_soap_results.txt > Sorted_final_results.txt # Append the sorted results from all other tables find . -name gff_compared_to_soap_results.txt -exec tail -n +2 {} \; | sort >> Sorted_final_results.txt
FAQ
Reference
- Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
Back to Main_Page