Difference between revisions of "SGSGeneLoss"

From Applied Bioinformatics Group
Jump to: navigation, search
Line 32: Line 32:
 
** Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
 
** Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
 
* Output files
 
* Output files
** Result files for each chromosome separately
+
** Result files for each chromosome separately - .excov
** File with overall stats - stats.txt
+
** File with overall stats - stats.csv
** File with summary for all the chromosomes used - chrs.txt (this file is used by one of the R scripts)
+
** File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
** File with list of genes lost for all the chromosomes - graph.txt (this file is used by one of the R scripts)
+
** File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)
  
 
== Command line options for SGSGeneLoss.jar==
 
== Command line options for SGSGeneLoss.jar==
Line 79: Line 79:
 
**basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
 
**basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
 
**extended format: contains additional columns with information about each of the exons
 
**extended format: contains additional columns with information about each of the exons
*stats.txt - file with summary information about all genes
+
*stats.csv - file with summary information about all genes
*chrs.txt - file with summary information about chromosomes
+
*chrs.csv - file with summary information about chromosomes
 
**chr,start,end,len
 
**chr,start,end,len
*graph.txt - file with list of genes lost as determined by lostCutoff
+
*graph.csv - file with list of genes lost as determined by lostCutoff
 
**chr,id,start,end
 
**chr,id,start,end
  
Line 134: Line 134:
 
3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:
 
3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:
  
first file in the list is the innermost circle, so if you have graph1.txt,graph2.txt,graph3.txt, order of circles will reflect order of files, starting from the inside
+
first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside
  
 
4. output file
 
4. output file

Revision as of 01:52, 16 June 2014

What does SGSGeneLoss depend on?

SGSGeneLoss depends on the following:

Download

  • Latest Version 0.1 (29/04/2014):
    • SGSGeneLoss.v0.1.tar.gz should contain
      • three main programs: SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R
      • readme file
      • folder with source code

From now on in this manual SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R are referred to as SGSGeneLoss.jar, graph_chromosomes.R, graph_circles.R

To run the programs you have to use full names SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R

How to install?

  • SGSGeneLoss.tar.gz
  • Unpack SGSGeneLoss.tar.gz and place SGSGeneLoss.jar and all the R scripts in chosen directory/directories, for example ./my_geneloss
  • Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
    • The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
  • Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
  • Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
  • Now you are ready to run SGSGeneLoss

Input and output files for SGSGeneLoss.jar

  • Input files:
    • Sorted, indexed .bam file with sequencing reads mapped to the reference genome sequence, multiple .bam files can be provided as comma separated list
    • Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
  • Output files
    • Result files for each chromosome separately - .excov
    • File with overall stats - stats.csv
    • File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
    • File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)

Command line options for SGSGeneLoss.jar

Required:

bamPath - path to your bam file/files, has to end with / or \ bamPath=/home/my_bams/

bamFileList - a single .bam file or a comma separated list, only file names, bam and corresponding .bai files have to be in a directory provided in bamPath bamFileList=bam1.bam,bam2.bam

gffFile - location of gff3 file gffFile=/home/my_gffs/annot.gff3

outDirPath - location output directory, has to end with / or \ outDirPath=/home/my_results

Optional:

minCov - minimal coverage threshold to consider position covered [minCov=1]

chromosomeList - comma separated list of chromosomes to be used for analysis, use all, for all chromosomes [chromosomeList=all]

lostCutoff - coverage cutoff to consider gene as lost for calculating stats [lostCutoff=0.0]

covCats - coverage categories for visualization [cavCats=0,10,20,30,40,70]

extendedFmt - used extended format, additional info included in output files [regular format]

To see help run: java -jar SGSGeneLoss.jar help

Sample command

  • Move into directory where SGSGeneLoss.jar is
  • Please make sure that all your supplied paths end with / or \
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3 outDirPath=/home/uqagnieszka/results/
chromosomeList=all
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam, arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt

Output files format

All the output files are comma separated text files.

  • .excov files - files with results for each chromosome (files use chromosome names as in .bam files), files come in two formats basic (default) or extended (extendedFmt)
    • basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
    • extended format: contains additional columns with information about each of the exons
  • stats.csv - file with summary information about all genes
  • chrs.csv - file with summary information about chromosomes
    • chr,start,end,len
  • graph.csv - file with list of genes lost as determined by lostCutoff
    • chr,id,start,end

Plotting results

Results are visualized using R scripts.

Two ways of visualization are possible:

  • results per chromosome
  • results for all chromosomes as a circular graph

Results per chromosome:

What you need:

  • scripts graph_chromosomes.R, graph_main.R in the same directory
  • .excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
  • directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found
  • file listing all the result files for which you want graphs drawn, one per line - for example graph_list.txt file which looks like this:
Chr1.excov
Chr2.excov
Chr3.excov

graph_chromosomes.R takes three arguments in this order:

1. location of directory where .excov file are located

2. file listing all the result files for which you want graphs drawn

3. gene loss cutoff

Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results /home/uqagnieszka/results/graph_list.txt 0.0

Summary results for all chromosomes, possibly multiple samples:

What you need:

  • script graph_circles.R
  • graph.txt from SGSGeneLoss.jar run
  • chrs.txt from SGSGeneLoss.jar run
  • file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
chrs,no
chr1,1
chr2,2
chr10,10

graph_circles.R takes four arguments in this order:

1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run

2. file with chromosome order

3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:

first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside

4. output file

Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv out.png

FAQ

  • If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files


Back to Main_Page