CoReFinder
CoReFinder - a pipeline to find collapsed and repetitive regions in genome assemblies
Methodology
The coverage analysis is based on SOAP mappings (-r0, -r1 and -r2) of PE reads:
- map uniquely to the genome (-r 0)
- map to more that one location in the genome, but only one hit is reported at random (-r 1)
- map to all locations in the genome, i.e. all hits (-r 2)
and for each of -r0, -r1 and -r2 mapping results, we compare the coverage at each position across the assembly using custom scripts.
Defining the regions of interest
Repeated/non-collapsed region
If we map reads to the reference with –r0, there will be regions where reads do not map. This could happen because of a number of reasons: the region is not in the raw data, the region in the genome where we want to map the reads contain N's or tiny repetitive regions or more likely that they represent large duplicated regions in the assembly. Because the reads can map to more than one place, using -r0 will discard them. If we map the same reads to Darmor with –r1, and identify regions empty with –r0 but covered with –r1 then these are duplicated in the assembly. Also, if we map reads with -r2, a duplicated, non-collapsed region should have roughly half of the reads in -r1 compared to -r2, a triplicated region should have roughly a third etc.
Repetitive/collapsed region
Coverage higher than the average coverage with -r0, -r1 and -r2 will be repetitive/collapsed regions of the assembly. This is because these reads have nowhere else to map in the genome, so they will pile up at the collapsed portion of the assembly.
Zero or low coverage regions
Zero or very low coverage with -r0, -r1 and -r2.
Pipeline
Prerequisites
1. bedtools from https://github.com/arq5x/bedtools2 - at least version v2.25.0, so that genomecov -d reports positions with 0 reads aligning
2. SOAPaligner from http://soap.genomics.org.cn/soapaligner.html
3. samtools
4. R
5. CoRefinder from Bhavna Hurgobin download
Mapping
Align reads to the reference genome (indexed using 2bwt-builder) using each of -r0, -r1 and -r2:
soap -p 4 -r 0 -v 2 -m 0 -x 1000 -a R1_library.fastq -b R2_library.fastq -o Output.paired.soap -2 Output.single.soap -u Output.unmapped.soap
This will run SOAPaligner with 4 CPUs, an insert size from 0 to 1000, and the two files R1_library.fastq and R2_library.fastq
Three types of outputs are produced from the SOAP mappings: paired mapped reads, single/unpaired mapped reads, unmapped reads. We only use the paired mapped reads for the coverage analysis.
Convert .soap to sorted .bam
Use SOAP's soap2sam.pl:
soap2sam.pl Output_paired.soap > Output_paired.sam
and samtools view to convert to bam,
samtools faidx your_reference_genome.fasta samtools view -bt your_reference_genome.fasta.fai Output_paired.sam > Output_paired.bam
sort using samtools sort, then index bam file (samtools index)
samtools sort Output_paired.bam Output_paired_sorted samtools index Output_paired_sorted.bam
Finding per base coverage using Bedtools
1.Run genome cov for each of the bam files from SOAP -r 0, -r 1, and -r 2: bedtools genomecov computes histograms per-base. The -d option reports an output line describing the observed coverage at each and every position in the genome.
bedtools genomecov -d -ibam your_bam_file > your_output
2.Union Coverage To merge the results of the three coverage files. It takes about 10 min per chromosome.
bedtools unionbedg -i ../Soap_r0/genomecov/${chro}r0.txt ../Soap_r1/genomecov/${chro}r1.txt ../Soap_r2/genomecov/${chro}r2.txt -header -names r0 r1 r2 > ${chro}_all.cov
In case that doesn't work, use bash paste:
paste r0.cov r1.cov r2.cov | cut -f1-3,6,9 > all.cov
Custom script for coverage analysis - CoReFinder
R script to categorise regions of interest based on various statistics:
Rscript compareCoverage.R chrA01_all.cov chrA01
The script outputs a single file for each chromosome/scaffold, for e.g. chrA01_all_regions.out
Additional Utility programs
1.Bedtools can be used to compare annotation files, for e.g. if you want to compare the output from the coverage analysis with your genome annotation file, you can just do:
bedtools intersect -a genome_annotation.gff -b coverage_analysis.out -wa > interesting_genes.gff
The -wa option is used to write the original entry in genome_annotation.gff for each overlap, i.e. you will know if any of the collapsed regions contain genes, for e.g. You need to make sure that your genome_annotation.gff file is sorted by start position, and that the chromosome names are consistent between files, for e.g. if you have 'chrA01' in one file, and 'chrA01_contigs_placed' in the other, bedtools intersect will not work. It's better to split the genome_annotation.gff file into chromosome.
If you want to know what percentage of your gene is covered by your region of interest you can do:
bedtools intersect -a genome_annotation.gff -b coverage_analysis.gff -wa -wb -f 0.5 -r > interesting_genes.out
'-f 0.5' will report genes that are covered by at least 50%. This threshold can be changed if required. The complete usage of 'bedtools intersect' can be found here: http://bedtools.readthedocs.org/en/latest/content/example-usage.html
2.Bedtools can also be used to merge regions that are within a number of base pairs from each other:
bedtools merge -i coverage_analysis.gff -d 1000 > coverage_analysis_merged_1kb.gff
This option may be useful in cases where the read mapping coverage in the regions of interest is not uniform, thereby creating 'gaps' in the output like this:
start end region.length r0 r1 r2 9101920 9102302 383 1170 1170 1170 9102342 9102732 391 1173 1173 1173 9102775 9103018 244 1137.5 1137.5 1137.5 9103093 9103794 702 1259.5 1259.5 1259.5
The above regions could be merged into a single region (9101920 to 9103794), and in this way when we are looking for genes in this entire region, we are less likely to miss them compared to if we were looking at smaller regions.
FAQ
To come
Questions
Please feel free to ask Philipp at philipp.bayer@uwa.edu.au if anything comes up!