Difference between revisions of "DiffKAP"
|  (→Download) | |||
| (14 intermediate revisions by 3 users not shown) | |||
| Line 1: | Line 1: | ||
| − | + | Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.   | |
| − | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of  | + | With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.  | 
| + | |||
| + | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use. | ||
| Line 17: | Line 19: | ||
| ** GD::Graph::linespoints  (for the script identifyKmerSize) | ** GD::Graph::linespoints  (for the script identifyKmerSize) | ||
| + | |||
| + | == Download == | ||
| + | * Latest Version 0.9 (23/09/2013): | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_0.9.zip DiffKAP package] | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_sampleProj_testData.tar.gz Test Data] | ||
| + | ** [http://appliedbioinformatics.com.au/download/DiffKAP/sampleProj_results.tar.gz Results of the sample data] | ||
| + | * Archived Versions: | ||
| + | ** | ||
| == How to install? == | == How to install? == | ||
| − | * Download the [http://appliedbioinformatics.com.au/ | + | * Download the [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download DiffKAP package]. | 
| * Uncompress it into: | * Uncompress it into: | ||
| ** a DiffKAP setup file | ** a DiffKAP setup file | ||
| Line 26: | Line 36: | ||
| ** an example data folder containing a small subset of a metatranscriptomic data | ** an example data folder containing a small subset of a metatranscriptomic data | ||
| * read the README | * read the README | ||
| − | * Install the DiffKAP setup script by  | + | * Install the DiffKAP setup script by executing: DiffKAP_setup | 
| * *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP *** | * *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP *** | ||
| − | |||
| == How to run? == | == How to run? == | ||
| − | #  | + | # Create your project configuration file by using the example config file in the sample data directory as a template. | 
| − | |||
| − | |||
| # Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg | # Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg | ||
| * Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file. | * Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file. | ||
| Line 40: | Line 47: | ||
| == How to interpret the results? == | == How to interpret the results? == | ||
| − | * You can download the results of the sample data [http://appliedbioinformatics.com.au/ | + | * You can download the results of the sample data [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download here]. | 
| − | |||
| − | |||
| − | |||
| * The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results: | * The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results: | ||
| *# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns: | *# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns: | ||
| − | *#* Median-T1: The median  | + | *#* Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.     | 
| *#* Median-T2: Similar to Median-T1 but for Treatment 2.   | *#* Median-T2: Similar to Median-T1 but for Treatment 2.   | ||
| *#* Ratio of Median: The ratio of Median-T1 to Median-T2. | *#* Ratio of Median: The ratio of Median-T1 to Median-T2. | ||
| Line 52: | Line 56: | ||
| *#* CV-T2: Similar to CV-T1 but for Treatment 2. | *#* CV-T2: Similar to CV-T1 but for Treatment 2. | ||
| *# 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER. | *# 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER. | ||
| − | *# A gene- | + | *# A gene-centric summary with the word 'DEG' in the filename: | 
| *#* In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.   | *#* In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.   | ||
| *#* It shows the unique gene list. | *#* It shows the unique gene list. | ||
| *#* The 'Total' column is the total number of DER annotated to such gene. | *#* The 'Total' column is the total number of DER annotated to such gene. | ||
| *#* It is sorted by the 'Total' column in reversed order. | *#* It is sorted by the 'Total' column in reversed order. | ||
| − | *# A  | + | *# A result summary file with 'summary.log' in the filename. | 
| − | |||
| == FAQ == | == FAQ == | ||
| *   | *   | ||
| + | |||
| == Reference == | == Reference == | ||
| * Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | * Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | ||
| + | |||
| + | |||
| + | |||
| + | == Citation == | ||
| + | * Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052 | ||
| Back to [[Main_Page]] | Back to [[Main_Page]] | ||
Latest revision as of 08:26, 14 August 2017
Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.
With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.
The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.
Contents
What does DiffKAP depend on?
DiffKAP depends on the following things:
- Jellyfish for fast kmer counting
- blastx for sequence alignment
-  Some non-standard Perl modules:
-  bioperl
- Bio::SeqIO
- Bio::SearchIO
 
- Parallel::ForkManager
- Statistics::Descriptive
- Config::IniFiles
- GD::Graph::linespoints (for the script identifyKmerSize)
 
-  bioperl
Download
- Latest Version 0.9 (23/09/2013):
-  Archived Versions:
How to install?
- Download the DiffKAP package.
-  Uncompress it into:
- a DiffKAP setup file
- a README file
- a VERSION file
- an example data folder containing a small subset of a metatranscriptomic data
 
- read the README
- Install the DiffKAP setup script by executing: DiffKAP_setup
- *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***
How to run?
- Create your project configuration file by using the example config file in the sample data directory as a template.
- Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
- Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
- The processing log is stored in /tmp/DiffKAP.log by default.
How to interpret the results?
- You can download the results of the sample data here.
-  The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
-  5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
- Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
- Median-T2: Similar to Median-T1 but for Treatment 2.
- Ratio of Median: The ratio of Median-T1 to Median-T2.
- CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
- CV-T2: Similar to CV-T1 but for Treatment 2.
 
- 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.
-  A gene-centric summary with the word 'DEG' in the filename:
- In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.
- It shows the unique gene list.
- The 'Total' column is the total number of DER annotated to such gene.
- It is sorted by the 'Total' column in reversed order.
 
- A result summary file with 'summary.log' in the filename.
 
-  5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
FAQ
Reference
- Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
Citation
- Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052
Back to Main_Page
