Difference between revisions of "CoReFinder"

From Applied Bioinformatics Group
Jump to: navigation, search
(Defining the regions of interest)
(Prerequisites)
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
CoReFinder
+
CoReFinder - a pipeline to find collapsed and repetitive regions in genome assemblies
  
 
=Methodology=
 
=Methodology=
Line 13: Line 13:
  
 
==Repeated/non-collapsed region==
 
==Repeated/non-collapsed region==
 +
 
If we map reads to the reference with –r0, there will be regions where reads do not map. This could happen because of a number of reasons: the region is not in the raw data, the region in the genome where we want to map the reads contain N's or tiny repetitive regions or more likely that they represent large duplicated regions in the assembly. Because the reads can map to more than one place, using -r0 will discard them. If we map the same reads to Darmor with –r1, and identify regions empty with –r0 but covered with –r1 then these are duplicated in the assembly. Also, if we map reads with -r2, a duplicated, non-collapsed region should have roughly half of the reads in -r1 compared to -r2, a triplicated region should have roughly a third etc.
 
If we map reads to the reference with –r0, there will be regions where reads do not map. This could happen because of a number of reasons: the region is not in the raw data, the region in the genome where we want to map the reads contain N's or tiny repetitive regions or more likely that they represent large duplicated regions in the assembly. Because the reads can map to more than one place, using -r0 will discard them. If we map the same reads to Darmor with –r1, and identify regions empty with –r0 but covered with –r1 then these are duplicated in the assembly. Also, if we map reads with -r2, a duplicated, non-collapsed region should have roughly half of the reads in -r1 compared to -r2, a triplicated region should have roughly a third etc.
  
 
==Repetitive/collapsed region==
 
==Repetitive/collapsed region==
 +
 
Coverage higher than the average coverage with -r0, -r1 and -r2 will be repetitive/collapsed regions of the assembly. This is because these reads have nowhere else to map in the genome, so they will pile up at the collapsed portion of the assembly.
 
Coverage higher than the average coverage with -r0, -r1 and -r2 will be repetitive/collapsed regions of the assembly. This is because these reads have nowhere else to map in the genome, so they will pile up at the collapsed portion of the assembly.
  
 
==Zero or low coverage regions==
 
==Zero or low coverage regions==
 +
 
Zero or very low coverage with -r0, -r1 and -r2.
 
Zero or very low coverage with -r0, -r1 and -r2.
  
 
=Pipeline=
 
=Pipeline=
 +
==Prerequisites==
 +
 +
1. bedtools from https://github.com/arq5x/bedtools2 - at least version v2.25.0, so that genomecov -d reports positions with 0 reads aligning
 +
 +
2. SOAPaligner from http://soap.genomics.org.cn/soapaligner.html
 +
 +
3. samtools
 +
 +
4. R
 +
 +
5. CoRefinder from Bhavna Hurgobin [http://appliedbioinformatics.com.au/download/compareCoverage_V9_BH.R download]
 +
 
==Mapping==
 
==Mapping==
 +
 
Align reads to the reference genome (indexed using 2bwt-builder) using each of -r0, -r1 and -r2:
 
Align reads to the reference genome (indexed using 2bwt-builder) using each of -r0, -r1 and -r2:
  
   soap -p4 -r0 -v 2 -m 0 -x 1000 -a /scratch/y82/bhurgobin/Brassica/DarmorV8/raw_reads/10x/Darmor_10x_R1_aa.fastq -b /scratch/y82/bhurgobin/Brassica/DarmorV8/raw_reads/10x/Darmor_10x_R2_aa.fastq -o /scratch/y82/bhurgobin/Brassica/DarmorV8/raw_reads/10x/Darmor_10x_R1_aa.fastq.paired.soap -2 /scratch/y82/bhurgobin/Brassica/DarmorV8/raw_reads/10x/Darmor_10x_R1_aa.fastq.single.soap -u /scratch/y82/bhurgobin/Brassica/DarmorV8/raw_reads/10x/Darmor_10x_R1_aa.fastq.unmapped.soap
+
   soap -p 4 -r 0 -v 2 -m 0 -x 1000 -a R1_library.fastq -b R2_library.fastq -o Output.paired.soap -2 Output.single.soap -u Output.unmapped.soap
 +
 
 +
This will run SOAPaligner with 4 CPUs, an insert size from 0 to 1000, and the two files R1_library.fastq and R2_library.fastq
  
 
Three types of outputs are produced from the SOAP mappings: paired mapped reads, single/unpaired mapped reads, unmapped reads. We only use the paired mapped reads for the coverage analysis.
 
Three types of outputs are produced from the SOAP mappings: paired mapped reads, single/unpaired mapped reads, unmapped reads. We only use the paired mapped reads for the coverage analysis.
Line 31: Line 49:
 
==Convert .soap to sorted .bam==
 
==Convert .soap to sorted .bam==
  
Use SOAP's soap2sam.pl and samtools view -bS to convert to bam, and sort using samtools sort, then index bam file (samtools index)
+
Use SOAP's soap2sam.pl:
 +
 
 +
    soap2sam.pl Output_paired.soap > Output_paired.sam
 +
 
 +
and samtools view to convert to bam,  
 +
 
 +
    samtools faidx your_reference_genome.fasta
 +
    samtools view -bt your_reference_genome.fasta.fai Output_paired.sam > Output_paired.bam
 +
 
 +
sort using samtools sort, then index bam file (samtools index)
 +
 +
    samtools sort Output_paired.bam Output_paired_sorted
 +
    samtools index Output_paired_sorted.bam
  
 
==Finding per base coverage using Bedtools==
 
==Finding per base coverage using Bedtools==
Line 52: Line 82:
 
==Custom script for coverage analysis - CoReFinder ==
 
==Custom script for coverage analysis - CoReFinder ==
  
[path DOWNLOAD]
+
[http://appliedbioinformatics.com.au/download/compareCoverage_V7_BH.R DOWNLOAD]
  
 
R script to categorise regions of interest based on various statistics:
 
R script to categorise regions of interest based on various statistics:
Line 88: Line 118:
  
 
The above regions could be merged into a single region (9101920 to 9103794), and in this way when we are looking for genes in this entire region, we are less likely to miss them compared to if we were looking at smaller regions.
 
The above regions could be merged into a single region (9101920 to 9103794), and in this way when we are looking for genes in this entire region, we are less likely to miss them compared to if we were looking at smaller regions.
 +
 +
 +
= FAQ =
 +
 +
To come
 +
 +
= Questions =
 +
 +
Please feel free to ask Philipp at philipp.bayer@uwa.edu.au if anything comes up!

Latest revision as of 04:10, 7 April 2016

CoReFinder - a pipeline to find collapsed and repetitive regions in genome assemblies

Methodology

The coverage analysis is based on SOAP mappings (-r0, -r1 and -r2) of PE reads:

  1. map uniquely to the genome (-r 0)
  2. map to more that one location in the genome, but only one hit is reported at random (-r 1)
  3. map to all locations in the genome, i.e. all hits (-r 2)

and for each of -r0, -r1 and -r2 mapping results, we compare the coverage at each position across the assembly using custom scripts.

Defining the regions of interest

Repeated/non-collapsed region

If we map reads to the reference with –r0, there will be regions where reads do not map. This could happen because of a number of reasons: the region is not in the raw data, the region in the genome where we want to map the reads contain N's or tiny repetitive regions or more likely that they represent large duplicated regions in the assembly. Because the reads can map to more than one place, using -r0 will discard them. If we map the same reads to Darmor with –r1, and identify regions empty with –r0 but covered with –r1 then these are duplicated in the assembly. Also, if we map reads with -r2, a duplicated, non-collapsed region should have roughly half of the reads in -r1 compared to -r2, a triplicated region should have roughly a third etc.

Repetitive/collapsed region

Coverage higher than the average coverage with -r0, -r1 and -r2 will be repetitive/collapsed regions of the assembly. This is because these reads have nowhere else to map in the genome, so they will pile up at the collapsed portion of the assembly.

Zero or low coverage regions

Zero or very low coverage with -r0, -r1 and -r2.

Pipeline

Prerequisites

1. bedtools from https://github.com/arq5x/bedtools2 - at least version v2.25.0, so that genomecov -d reports positions with 0 reads aligning

2. SOAPaligner from http://soap.genomics.org.cn/soapaligner.html

3. samtools

4. R

5. CoRefinder from Bhavna Hurgobin download

Mapping

Align reads to the reference genome (indexed using 2bwt-builder) using each of -r0, -r1 and -r2:

 soap -p 4 -r 0 -v 2 -m 0 -x 1000 -a R1_library.fastq -b R2_library.fastq -o Output.paired.soap -2 Output.single.soap -u Output.unmapped.soap

This will run SOAPaligner with 4 CPUs, an insert size from 0 to 1000, and the two files R1_library.fastq and R2_library.fastq

Three types of outputs are produced from the SOAP mappings: paired mapped reads, single/unpaired mapped reads, unmapped reads. We only use the paired mapped reads for the coverage analysis.

Convert .soap to sorted .bam

Use SOAP's soap2sam.pl:

   soap2sam.pl Output_paired.soap > Output_paired.sam

and samtools view to convert to bam,

   samtools faidx your_reference_genome.fasta
   samtools view -bt your_reference_genome.fasta.fai Output_paired.sam > Output_paired.bam

sort using samtools sort, then index bam file (samtools index)

   samtools sort Output_paired.bam Output_paired_sorted
   samtools index Output_paired_sorted.bam

Finding per base coverage using Bedtools

1.Run genome cov for each of the bam files from SOAP -r 0, -r 1, and -r 2: bedtools genomecov computes histograms per-base. The -d option reports an output line describing the observed coverage at each and every position in the genome.

 bedtools genomecov -d -ibam your_bam_file > your_output

2.Union Coverage To merge the results of the three coverage files. It takes about 10 min per chromosome.

 bedtools unionbedg -i ../Soap_r0/genomecov/${chro}r0.txt ../Soap_r1/genomecov/${chro}r1.txt ../Soap_r2/genomecov/${chro}r2.txt -header -names r0 r1 r2 > ${chro}_all.cov

In case that doesn't work, use bash paste:

 paste r0.cov r1.cov r2.cov | cut -f1-3,6,9 > all.cov

Custom script for coverage analysis - CoReFinder

DOWNLOAD

R script to categorise regions of interest based on various statistics:

 Rscript compareCoverage.R chrA01_all.cov chrA01

The script outputs a single file for each chromosome/scaffold, for e.g. chrA01_all_regions.out

Additional Utility programs

1.Bedtools can be used to compare annotation files, for e.g. if you want to compare the output from the coverage analysis with your genome annotation file, you can just do:

bedtools intersect -a genome_annotation.gff -b coverage_analysis.out -wa > interesting_genes.gff

The -wa option is used to write the original entry in genome_annotation.gff for each overlap, i.e. you will know if any of the collapsed regions contain genes, for e.g. You need to make sure that your genome_annotation.gff file is sorted by start position, and that the chromosome names are consistent between files, for e.g. if you have 'chrA01' in one file, and 'chrA01_contigs_placed' in the other, bedtools intersect will not work. It's better to split the genome_annotation.gff file into chromosome.

If you want to know what percentage of your gene is covered by your region of interest you can do:

bedtools intersect -a genome_annotation.gff -b coverage_analysis.gff -wa -wb -f 0.5 -r > interesting_genes.out

'-f 0.5' will report genes that are covered by at least 50%. This threshold can be changed if required. The complete usage of 'bedtools intersect' can be found here: http://bedtools.readthedocs.org/en/latest/content/example-usage.html

2.Bedtools can also be used to merge regions that are within a number of base pairs from each other:

 bedtools merge -i coverage_analysis.gff -d 1000 > coverage_analysis_merged_1kb.gff

This option may be useful in cases where the read mapping coverage in the regions of interest is not uniform, thereby creating 'gaps' in the output like this:

start  end  region.length r0  r1  r2
9101920 9102302 383   1170    1170    1170
9102342 9102732 391   1173    1173    1173
9102775 9103018 244   1137.5  1137.5  1137.5
9103093 9103794 702   1259.5  1259.5  1259.5

The above regions could be merged into a single region (9101920 to 9103794), and in this way when we are looking for genes in this entire region, we are less likely to miss them compared to if we were looking at smaller regions.


FAQ

To come

Questions

Please feel free to ask Philipp at philipp.bayer@uwa.edu.au if anything comes up!