From Applied Bioinformatics Group
Jump to: navigation, search

CoReFinder - a pipeline to find collapsed and repetitive regions in genome assemblies


The coverage analysis is based on SOAP mappings (-r0, -r1 and -r2) of PE reads:

  1. map uniquely to the genome (-r 0)
  2. map to more that one location in the genome, but only one hit is reported at random (-r 1)
  3. map to all locations in the genome, i.e. all hits (-r 2)

and for each of -r0, -r1 and -r2 mapping results, we compare the coverage at each position across the assembly using custom scripts.

Defining the regions of interest

Repeated/non-collapsed region

If we map reads to the reference with –r0, there will be regions where reads do not map. This could happen because of a number of reasons: the region is not in the raw data, the region in the genome where we want to map the reads contain N's or tiny repetitive regions or more likely that they represent large duplicated regions in the assembly. Because the reads can map to more than one place, using -r0 will discard them. If we map the same reads to Darmor with –r1, and identify regions empty with –r0 but covered with –r1 then these are duplicated in the assembly. Also, if we map reads with -r2, a duplicated, non-collapsed region should have roughly half of the reads in -r1 compared to -r2, a triplicated region should have roughly a third etc.

Repetitive/collapsed region

Coverage higher than the average coverage with -r0, -r1 and -r2 will be repetitive/collapsed regions of the assembly. This is because these reads have nowhere else to map in the genome, so they will pile up at the collapsed portion of the assembly.

Zero or low coverage regions

Zero or very low coverage with -r0, -r1 and -r2.



1. bedtools from - at least version v2.25.0, so that genomecov -d reports positions with 0 reads aligning

2. SOAPaligner from

3. samtools

4. R

5. CoRefinder from Bhavna Hurgobin download


Align reads to the reference genome (indexed using 2bwt-builder) using each of -r0, -r1 and -r2:

 soap -p 4 -r 0 -v 2 -m 0 -x 1000 -a R1_library.fastq -b R2_library.fastq -o Output.paired.soap -2 Output.single.soap -u Output.unmapped.soap

This will run SOAPaligner with 4 CPUs, an insert size from 0 to 1000, and the two files R1_library.fastq and R2_library.fastq

Three types of outputs are produced from the SOAP mappings: paired mapped reads, single/unpaired mapped reads, unmapped reads. We only use the paired mapped reads for the coverage analysis.

Convert .soap to sorted .bam

Use SOAP's Output_paired.soap > Output_paired.sam

and samtools view to convert to bam,

   samtools faidx your_reference_genome.fasta
   samtools view -bt your_reference_genome.fasta.fai Output_paired.sam > Output_paired.bam

sort using samtools sort, then index bam file (samtools index)

   samtools sort Output_paired.bam Output_paired_sorted
   samtools index Output_paired_sorted.bam

Finding per base coverage using Bedtools

1.Run genome cov for each of the bam files from SOAP -r 0, -r 1, and -r 2: bedtools genomecov computes histograms per-base. The -d option reports an output line describing the observed coverage at each and every position in the genome.

 bedtools genomecov -d -ibam your_bam_file > your_output

2.Union Coverage To merge the results of the three coverage files. It takes about 10 min per chromosome.

 bedtools unionbedg -i ../Soap_r0/genomecov/${chro}r0.txt ../Soap_r1/genomecov/${chro}r1.txt ../Soap_r2/genomecov/${chro}r2.txt -header -names r0 r1 r2 > ${chro}_all.cov

In case that doesn't work, use bash paste:

 paste r0.cov r1.cov r2.cov | cut -f1-3,6,9 > all.cov

Custom script for coverage analysis - CoReFinder


R script to categorise regions of interest based on various statistics:

 Rscript compareCoverage.R chrA01_all.cov chrA01

The script outputs a single file for each chromosome/scaffold, for e.g. chrA01_all_regions.out

Additional Utility programs

1.Bedtools can be used to compare annotation files, for e.g. if you want to compare the output from the coverage analysis with your genome annotation file, you can just do:

bedtools intersect -a genome_annotation.gff -b coverage_analysis.out -wa > interesting_genes.gff

The -wa option is used to write the original entry in genome_annotation.gff for each overlap, i.e. you will know if any of the collapsed regions contain genes, for e.g. You need to make sure that your genome_annotation.gff file is sorted by start position, and that the chromosome names are consistent between files, for e.g. if you have 'chrA01' in one file, and 'chrA01_contigs_placed' in the other, bedtools intersect will not work. It's better to split the genome_annotation.gff file into chromosome.

If you want to know what percentage of your gene is covered by your region of interest you can do:

bedtools intersect -a genome_annotation.gff -b coverage_analysis.gff -wa -wb -f 0.5 -r > interesting_genes.out

'-f 0.5' will report genes that are covered by at least 50%. This threshold can be changed if required. The complete usage of 'bedtools intersect' can be found here:

2.Bedtools can also be used to merge regions that are within a number of base pairs from each other:

 bedtools merge -i coverage_analysis.gff -d 1000 > coverage_analysis_merged_1kb.gff

This option may be useful in cases where the read mapping coverage in the regions of interest is not uniform, thereby creating 'gaps' in the output like this:

start  end  region.length r0  r1  r2
9101920 9102302 383   1170    1170    1170
9102342 9102732 391   1173    1173    1173
9102775 9103018 244   1137.5  1137.5  1137.5
9103093 9103794 702   1259.5  1259.5  1259.5

The above regions could be merged into a single region (9101920 to 9103794), and in this way when we are looking for genes in this entire region, we are less likely to miss them compared to if we were looking at smaller regions.


To come


Please feel free to ask Philipp at if anything comes up!