From Applied Bioinformatics Group
Jump to: navigation, search

Next generation DNA sequencing technologies such as RNA-Seq currently dominate genome wide gene expression studies. A standard approach to analyse this data requires mapping sequence reads to a reference and counting the number of reads which map to each gene. However, for many transcriptome studies a suitable reference genome is unavailable, especially for meta-transcriptome studies which assay gene expression from mixed populations of organisms. Where a reference is unavailable, it is possible to generate a reference by the de novo assembly of the sequence reads. However, the accurate assembly of such data is challenging, especially for meta-transcriptome data, and resulting assemblies frequently suffer from collapsed regions or chimeric sequences.

With the lack of reference assemblies currently limiting meta-transcriptome studies, we have established a Differential k-mer Analysis Pipeline (DiffKAP) for gene expression analysis, which does not require the generation of a reference for read mapping. By reducing each read to component k-mers and comparing the relative abundance of these sub-sequences, we overcome statistical limitations of whole read comparative analysis.

The DiffKAP application consists of a series of scripts written in Perl and Linux shell scripts and requires Jellyfish [Marcais 2011] and BLASTx as well as access to a copy of a blast-formatted protein database. The scripts are freely available for non-commercial use.

What does DiffKAP depend on?

DiffKAP depends on the following things:

  • Jellyfish for fast kmer counting
  • blastx for sequence alignment
  • Some non-standard Perl modules:
    • bioperl
      • Bio::SeqIO
      • Bio::SearchIO
    • Parallel::ForkManager
    • Statistics::Descriptive
    • Config::IniFiles
    • GD::Graph::linespoints (for the script identifyKmerSize)


How to install?

  • Download the DiffKAP package.
  • Uncompress it into:
    • a DiffKAP setup file
    • a README file
    • a VERSION file
    • an example data folder containing a small subset of a metatranscriptomic data
  • read the README
  • Install the DiffKAP setup script by executing: DiffKAP_setup
  • *** If you like, you can add the DiffKAP path to $PATH or just use an absolute path for running DiffKAP ***

How to run?

  1. Create your project configuration file by using the example config file in the sample data directory as a template.
  2. Run the pipeline: Run DiffKAP with your config file as an input argument, for example: DiffKAP ~/sampleProj/sampleProj.cfg
  • Results will be generated in the [OUT_DIR]/results where [OUT_DIR] is defined in the config file.
  • The processing log is stored in /tmp/DiffKAP.log by default.

How to interpret the results?

  • You can download the results of the sample data here.
  • The script "DiffKAP" generates 4 types of files in folder [OUT_DIR]/results:
    1. 5 DER files with the word 'AllDER' in the filenames. Explanation of some columns:
      • Median-T1: The median k-mer occurrence represented in Treatment 1 (corresponding to T1_ID in the config file) for all kmers in the read.
      • Median-T2: Similar to Median-T1 but for Treatment 2.
      • Ratio of Median: The ratio of Median-T1 to Median-T2.
      • CV-T1: The coefficient of variation of all kmer occurrence represented in Treatment 1 for all kmers in the read. To show how confident the Median-T1 representing all kmers in the read.
      • CV-T2: Similar to CV-T1 but for Treatment 2.
    2. 5 annotated DER files with the word 'AnnotatedDER' in the filenames. These files are similar to the 5 DER above but contain only the annotated DER.
    3. A gene-centric summary with the word 'DEG' in the filename:
      • In a tabular form showing the number of DER in the specific files (in columns 3-7) annotated to the specific gene.
      • It shows the unique gene list.
      • The 'Total' column is the total number of DER annotated to such gene.
      • It is sorted by the 'Total' column in reversed order.
    4. A result summary file with 'summary.log' in the filename.



  • Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers, Bioinformatics, 27, 764-770.


  • Rosic N, Kaniewska P, Chan C-K, Ling E, Edwards D, Dove S, Hoegh-Guldberg O: Early transcriptional changes in the reef-building coral Acropora aspera in response to thermal and nutrient stress. BMC Genomics 2014, 15(1):1052

Back to Main_Page