Difference between revisions of "DiffKAP"

From Applied Bioinformatics Group
Jump to: navigation, search
(Download)
 
(3 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.  
 
With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.  
  
The DiffKAP application consists of a series of scripts written in Python, Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and, optionally, SOAPaligner. The scripts are freely available for non-commercial use.
+
The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.
 +
 
  
 
== What does DiffKAP depend on? ==
 
== What does DiffKAP depend on? ==
Line 10: Line 11:
 
* blastx for sequence alignment
 
* blastx for sequence alignment
 
* Some non-standard Perl modules:
 
* Some non-standard Perl modules:
** BioPerl
+
** bioperl
 
*** Bio::SeqIO
 
*** Bio::SeqIO
 
*** Bio::SearchIO
 
*** Bio::SearchIO
Line 17: Line 18:
 
** Config::IniFiles
 
** Config::IniFiles
 
** GD::Graph::linespoints  (for the script identifyKmerSize)
 
** GD::Graph::linespoints  (for the script identifyKmerSize)
* Some non-standard Python modules:
+
 
** BioPython
 
*** Bio.SeqIO
 
* optional: SOAPaligner v2.21 for mapping reads to a (again, optional) reference
 
  
 
== Download ==
 
== Download ==
 
* Latest Version 0.9 (23/09/2013):
 
* Latest Version 0.9 (23/09/2013):
** [http://appliedbioinformatics.com.au/download/DiffKAP_0.9.zip DiffKAP package]
+
** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_0.9.zip DiffKAP package]
** [http://appliedbioinformatics.com.au/download/sampleProj_results.tar.gz Results of the sample data]
+
** [http://appliedbioinformatics.com.au/download/DiffKAP/DiffKAP_sampleProj_testData.tar.gz Test Data]
 +
** [http://appliedbioinformatics.com.au/download/DiffKAP/sampleProj_results.tar.gz Results of the sample data]
 
* Archived Versions:
 
* Archived Versions:
 
**
 
**
Line 37: Line 36:
 
** an example data folder containing a small subset of a metatranscriptomic data
 
** an example data folder containing a small subset of a metatranscriptomic data
 
* read the README
 
* read the README
* Install the DiffKAP setup script by executing: DiffKAP_setup (does this exist? - Philipp)
+
* Install the DiffKAP setup script by executing: DiffKAP_setup
 
* *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***
 
* *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***
  
 
== How to run? ==
 
== How to run? ==
# Currently, the pipeline assumes that all input files are in FASTA format, with one line per identifier, and one line per sequence. The script "FixFastq.py" transforms FASTQ reads into the needed FASTA format, usage:
 
<nowiki>python FixFastq.py your_input.fastq your_output.fasta</nowiki>
 
 
# Create your project configuration file by using the example config file in the sample data directory as a template.
 
# Create your project configuration file by using the example config file in the sample data directory as a template.
 
# Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
 
# Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
* Optionally, the pipeline can generate SOAP-alignments if the user supplies a reference in the configuration file
+
* Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
* Optionally, the pipeline can check the SOAP-alignments against a given gene annotation if the users supplies a gff3 file in the configuration file
 
* Results will be generated in the [OUT_DIR] where [OUT_DIR] is defined in the config file.
 
** These are the output folders:
 
*** PARs: This folder contains the PARs from both libraries in fasta format
 
*** SOAPs: If a reference genome is supplied by the user, this folder contains the alignments of the PAR-sequences to the reference
 
*** Results: If a gff3 file of genes on the reference genome is supplied by the user, this file contains tables of how many genes are hit by reads in the alignment
 
 
* The processing log is stored in /tmp/DiffKAP.log by default.
 
* The processing log is stored in /tmp/DiffKAP.log by default.
 +
  
 
== How to interpret the results? ==
 
== How to interpret the results? ==
 
* You can download the results of the sample data [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download here].
 
* You can download the results of the sample data [http://www.appliedbioinformatics.com.au/index.php/DiffKAP#Download here].
* The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/PARs:
+
* The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
 
*# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
 
*# 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
 
*#* Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.   
 
*#* Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.   
Line 78: Line 70:
 
== Reference ==
 
== Reference ==
 
* Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
 
* Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.
* A paper for the optimal k-mer count?
+
 
 +
 
 +
 
 +
== Citation ==
 +
* Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052
 +
 
  
  
 
Back to [[Main_Page]]
 
Back to [[Main_Page]]

Latest revision as of 08:26, 14 August 2017

Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.

With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.

The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.


What does DiffKAP depend on?

DiffKAP depends on the following things:

  • Jellyfish for fast kmer counting
  • blastx for sequence alignment
  • Some non-standard Perl modules:
    • bioperl
      • Bio::SeqIO
      • Bio::SearchIO
    • Parallel::ForkManager
    • Statistics::Descriptive
    • Config::IniFiles
    • GD::Graph::linespoints (for the script identifyKmerSize)


Download

How to install?

  • Download the DiffKAP package.
  • Uncompress it into:
    • a DiffKAP setup file
    • a README file
    • a VERSION file
    • an example data folder containing a small subset of a metatranscriptomic data
  • read the README
  • Install the DiffKAP setup script by executing: DiffKAP_setup
  • *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***

How to run?

  1. Create your project configuration file by using the example config file in the sample data directory as a template.
  2. Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
  • Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
  • The processing log is stored in /tmp/DiffKAP.log by default.


How to interpret the results?

  • You can download the results of the sample data here.
  • The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
    1. 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
      • Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
      • Median-T2: Similar to Median-T1 but for Treatment 2.
      • Ratio of Median: The ratio of Median-T1 to Median-T2.
      • CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
      • CV-T2: Similar to CV-T1 but for Treatment 2.
    2. 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.
    3. A gene-centric summary with the word 'DEG' in the filename:
      • In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.
      • It shows the unique gene list.
      • The 'Total' column is the total number of DER annotated to such gene.
      • It is sorted by the 'Total' column in reversed order.
    4. A result summary file with 'summary.log' in the filename.

FAQ


Reference

  • Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.


Citation

  • Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052


Back to Main_Page