Difference between revisions of "DiffKAP"
 (→Download)  | 
				|||
| (22 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| − | + | Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.    | |
| − | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of   | + | With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.   | 
| + | |||
| + | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.  | ||
== What does DiffKAP depend on? ==  | == What does DiffKAP depend on? ==  | ||
DiffKAP depends on the following things:  | DiffKAP depends on the following things:  | ||
| − | * [http://www.cbcb.umd.edu/software/jellyfish   | + | * [http://www.cbcb.umd.edu/software/jellyfish Jellyfish] for fast kmer counting  | 
* blastx for sequence alignment  | * blastx for sequence alignment  | ||
* Some non-standard Perl modules:  | * Some non-standard Perl modules:  | ||
| Line 15: | Line 17: | ||
** Statistics::Descriptive  | ** Statistics::Descriptive  | ||
** Config::IniFiles  | ** Config::IniFiles  | ||
| + | ** GD::Graph::linespoints  (for the script identifyKmerSize)  | ||
| + | |||
| + | == Download ==  | ||
| + | * Latest Version 0.9 (23/09/2013):  | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_0.9.zip DiffKAP package]  | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_sampleProj_testData.tar.gz Test Data]  | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/sampleProj_results.tar.gz Results of the sample data]  | ||
| + | * Archived Versions:  | ||
| + | **  | ||
== How to install? ==  | == How to install? ==  | ||
| − | * Download the [http://appliedbioinformatics.com.au/  | + | * Download the [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download DiffKAP package].  | 
* Uncompress it into:  | * Uncompress it into:  | ||
** a DiffKAP setup file  | ** a DiffKAP setup file  | ||
| Line 25: | Line 36: | ||
** an example data folder containing a small subset of a metatranscriptomic data  | ** an example data folder containing a small subset of a metatranscriptomic data  | ||
* read the README  | * read the README  | ||
| − | * Install the DiffKAP setup script by   | + | * Install the DiffKAP setup script by executing: DiffKAP_setup  | 
* *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***  | * *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***  | ||
| − | |||
== How to run? ==  | == How to run? ==  | ||
| − | #   | + | # Create your project configuration file by using the example config file in the sample data directory as a template.  | 
| − | # Run DiffKAP with your   | + | # Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg  | 
* Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.  | * Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.  | ||
* The processing log is stored in /tmp/DiffKAP.log by default.  | * The processing log is stored in /tmp/DiffKAP.log by default.  | ||
| − | ==   | + | == How to interpret the results? ==  | 
| + | * You can download the results of the sample data [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download here].  | ||
| + | * The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:  | ||
| + | *# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:  | ||
| + | *#* Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.     | ||
| + | *#* Median-T2: Similar to Median-T1 but for Treatment 2.   | ||
| + | *#* Ratio of Median: The ratio of Median-T1 to Median-T2.  | ||
| + | *#* CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.   | ||
| + | *#* CV-T2: Similar to CV-T1 but for Treatment 2.  | ||
| + | *# 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.  | ||
| + | *# A gene-centric summary with the word 'DEG' in the filename:  | ||
| + | *#* In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.   | ||
| + | *#* It shows the unique gene list.  | ||
| + | *#* The 'Total' column is the total number of DER annotated to such gene.  | ||
| + | *#* It is sorted by the 'Total' column in reversed order.  | ||
| + | *# A result summary file with 'summary.log' in the filename.  | ||
| + | |||
| + | == FAQ ==  | ||
*    | *    | ||
| + | |||
| + | |||
| + | |||
| + | == Reference ==  | ||
| + | * Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.  | ||
| + | |||
| + | |||
| + | |||
| + | == Citation ==  | ||
| + | * Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052  | ||
| + | |||
Back to [[Main_Page]]  | Back to [[Main_Page]]  | ||
Latest revision as of 08:26, 14 August 2017
Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.
With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.
The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.
Contents
What does DiffKAP depend on?
DiffKAP depends on the following things:
- Jellyfish for fast kmer counting
 - blastx for sequence alignment
 -  Some non-standard Perl modules:
-  bioperl
- Bio::SeqIO
 - Bio::SearchIO
 
 - Parallel::ForkManager
 - Statistics::Descriptive
 - Config::IniFiles
 - GD::Graph::linespoints (for the script identifyKmerSize)
 
 -  bioperl
 
Download
- Latest Version 0.9 (23/09/2013):
 -  Archived Versions:
 
How to install?
- Download the DiffKAP package.
 -  Uncompress it into:
- a DiffKAP setup file
 - a README file
 - a VERSION file
 - an example data folder containing a small subset of a metatranscriptomic data
 
 - read the README
 - Install the DiffKAP setup script by executing: DiffKAP_setup
 - *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***
 
How to run?
- Create your project configuration file by using the example config file in the sample data directory as a template.
 - Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
 
- Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
 - The processing log is stored in /tmp/DiffKAP.log by default.
 
How to interpret the results?
- You can download the results of the sample data here.
 -  The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
-  5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
- Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
 - Median-T2: Similar to Median-T1 but for Treatment 2.
 - Ratio of Median: The ratio of Median-T1 to Median-T2.
 - CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
 - CV-T2: Similar to CV-T1 but for Treatment 2.
 
 - 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.
 -  A gene-centric summary with the word 'DEG' in the filename:
- In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.
 - It shows the unique gene list.
 - The 'Total' column is the total number of DER annotated to such gene.
 - It is sorted by the 'Total' column in reversed order.
 
 - A result summary file with 'summary.log' in the filename.
 
 -  5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
 
FAQ
Reference
- Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
 
Citation
- Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052
 
Back to Main_Page