Difference between revisions of "DiffKAP"
(→Download) |
|||
(14 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | + | Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences. | |
− | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of | + | With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis. |
+ | |||
+ | The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use. | ||
Line 17: | Line 19: | ||
** GD::Graph::linespoints (for the script identifyKmerSize) | ** GD::Graph::linespoints (for the script identifyKmerSize) | ||
+ | |||
+ | == Download == | ||
+ | * Latest Version 0.9 (23/09/2013): | ||
+ | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_0.9.zip DiffKAP package] | ||
+ | ** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_sampleProj_testData.tar.gz Test Data] | ||
+ | ** [http://appliedbioinformatics.com.au/download/DiffKAP/sampleProj_results.tar.gz Results of the sample data] | ||
+ | * Archived Versions: | ||
+ | ** | ||
== How to install? == | == How to install? == | ||
− | * Download the [http://appliedbioinformatics.com.au/ | + | * Download the [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download DiffKAP package]. |
* Uncompress it into: | * Uncompress it into: | ||
** a DiffKAP setup file | ** a DiffKAP setup file | ||
Line 26: | Line 36: | ||
** an example data folder containing a small subset of a metatranscriptomic data | ** an example data folder containing a small subset of a metatranscriptomic data | ||
* read the README | * read the README | ||
− | * Install the DiffKAP setup script by | + | * Install the DiffKAP setup script by executing: DiffKAP_setup |
* *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP *** | * *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP *** | ||
− | |||
== How to run? == | == How to run? == | ||
− | # | + | # Create your project configuration file by using the example config file in the sample data directory as a template. |
− | |||
− | |||
# Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg | # Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg | ||
* Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file. | * Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file. | ||
Line 40: | Line 47: | ||
== How to interpret the results? == | == How to interpret the results? == | ||
− | * You can download the results of the sample data [http://appliedbioinformatics.com.au/ | + | * You can download the results of the sample data [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download here]. |
− | |||
− | |||
− | |||
* The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results: | * The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results: | ||
*# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns: | *# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns: | ||
− | *#* Median-T1: The median | + | *#* Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read. |
*#* Median-T2: Similar to Median-T1 but for Treatment 2. | *#* Median-T2: Similar to Median-T1 but for Treatment 2. | ||
*#* Ratio of Median: The ratio of Median-T1 to Median-T2. | *#* Ratio of Median: The ratio of Median-T1 to Median-T2. | ||
Line 52: | Line 56: | ||
*#* CV-T2: Similar to CV-T1 but for Treatment 2. | *#* CV-T2: Similar to CV-T1 but for Treatment 2. | ||
*# 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER. | *# 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER. | ||
− | *# A gene- | + | *# A gene-centric summary with the word 'DEG' in the filename: |
*#* In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene. | *#* In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene. | ||
*#* It shows the unique gene list. | *#* It shows the unique gene list. | ||
*#* The 'Total' column is the total number of DER annotated to such gene. | *#* The 'Total' column is the total number of DER annotated to such gene. | ||
*#* It is sorted by the 'Total' column in reversed order. | *#* It is sorted by the 'Total' column in reversed order. | ||
− | *# A | + | *# A result summary file with 'summary.log' in the filename. |
− | |||
== FAQ == | == FAQ == | ||
* | * | ||
+ | |||
== Reference == | == Reference == | ||
* Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | * Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770. | ||
+ | |||
+ | |||
+ | |||
+ | == Citation == | ||
+ | * Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052 | ||
Back to [[Main_Page]] | Back to [[Main_Page]] |
Latest revision as of 08:26, 14 August 2017
Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.
With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.
The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.
Contents
What does DiffKAP depend on?
DiffKAP depends on the following things:
- Jellyfish for fast kmer counting
- blastx for sequence alignment
- Some non-standard Perl modules:
- bioperl
- Bio::SeqIO
- Bio::SearchIO
- Parallel::ForkManager
- Statistics::Descriptive
- Config::IniFiles
- GD::Graph::linespoints (for the script identifyKmerSize)
- bioperl
Download
- Latest Version 0.9 (23/09/2013):
- Archived Versions:
How to install?
- Download the DiffKAP package.
- Uncompress it into:
- a DiffKAP setup file
- a README file
- a VERSION file
- an example data folder containing a small subset of a metatranscriptomic data
- read the README
- Install the DiffKAP setup script by executing: DiffKAP_setup
- *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***
How to run?
- Create your project configuration file by using the example config file in the sample data directory as a template.
- Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
- Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
- The processing log is stored in /tmp/DiffKAP.log by default.
How to interpret the results?
- You can download the results of the sample data here.
- The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
- 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
- Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
- Median-T2: Similar to Median-T1 but for Treatment 2.
- Ratio of Median: The ratio of Median-T1 to Median-T2.
- CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
- CV-T2: Similar to CV-T1 but for Treatment 2.
- 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.
- A gene-centric summary with the word 'DEG' in the filename:
- In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.
- It shows the unique gene list.
- The 'Total' column is the total number of DER annotated to such gene.
- It is sorted by the 'Total' column in reversed order.
- A result summary file with 'summary.log' in the filename.
- 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
FAQ
Reference
- Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
Citation
- Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052
Back to Main_Page