Skip Menu |
 

This queue is for tickets about the Bio-SamTools CPAN distribution.

Report information
The Basics
Id: 69903
Status: open
Priority: 0/
Queue: Bio-SamTools

People
Owner: Nobody in particular
Requestors: kr2 [...] sanger.ac.uk
Cc:
AdminCc:

Bug Information
Severity: Critical
Broken in: 1.29
Fixed in: (no value)



Subject: Coverage and pileup functions corrupted by samtools core
Download (untitled) / with headers
text/plain 1003b
Hi Lincoln, In the core of the pileup function a default max depth of 8000 reads is set. Can you expose the function that can override this value (bam_pileup.c bam_plp_set_maxcnt) so that accurate coverage stats can be generated please. The case that I found to expose this issue most dramatically is when a file contains split reads: 1. You have a position spanned by 8000 split reads that start at ~1:2000. 2. This position has 500 unsplit reads starting at 1:2500. 3. You do a mpileup on the position 1:2600. 4. None of the unsplit reads are seen by the function as the reads with the lowest start coord will be parsed first even though they do not cover the position. I've attached the following: 1. A BAM that exhibits this issue containing 8001 records, this also shows that the cutoff is actually 8000-1. 2. A dump of commands showing how the result changes from a bam with the last 7999 records up to the one I've attached. 3. The script used to produce the output in (2). Regards, Keiran
Subject: read_depth_at_pos.pl
#!/usr/bin/env perl use strict; use Data::Dumper; use Carp; use English qw( -no_match_vars ); use warnings FATAL => 'all'; use Bio::DB::Sam; print 'Bio::DB::Sam version: ',Bio::DB::Sam->VERSION,"\n"; my $d_count = 0; my $bam = shift @ARGV; my $chr = shift @ARGV; my $pos = shift @ARGV; my $sam = Bio::DB::Sam->new(-bam => $bam, -split_splices => 1); my @alignments = $sam->get_features_by_location(-seq_id => $chr, -start => $pos, -end => $pos); print "Raw feature_by_location: "; print (scalar @alignments); print "\n"; my $segment = $sam->segment(-seq_id=>$chr,-start=>$pos,-end=>$pos); my ($coverage) = $segment->features('coverage'); print "coverage: ",$coverage->coverage,"\n"; my $pos_str = $chr.':'.$pos.'-'.$pos; $sam->pileup($pos_str,\&depth_callback); print "pileup: $d_count\n"; $d_count = 0; $sam->fast_pileup($pos_str,\&fast_depth_callback); print "fast_pileup: $d_count\n"; print $d_count,"\n"; sub depth_callback { my ($seqid,$pos_l,$pileup) = @_; if($pos_l == $pos) { foreach my $p(@{$pileup}) { next if($p->is_refskip); $d_count++; } } } sub fast_depth_callback { my ($seqid,$pos_l,$pileup, $t_sam) = @_; if($pos_l == $pos) { foreach my $p(@{$pileup}) { next if($p->is_refskip); $d_count++; } } }
Subject: l8001.bam
Download l8001.bam
application/octet-stream 419.6k

Message body not shown because it is not plain text.

Subject: dump_of_result.txt
Download dump_of_result.txt
text/plain 525b
cgp-1-1-01[kr2]5: ~kr2/noddy/read_depth_at_pos.pl l7999.bam 2 85133126 Bio::DB::Sam version: 1.29 Raw feature_by_location: 7999 coverage: 282 pileup: 282 fast_pileup: 282 282 cgp-1-1-01[kr2]6: ~kr2/noddy/read_depth_at_pos.pl l8000.bam 2 85133126 Bio::DB::Sam version: 1.29 Raw feature_by_location: 8000 coverage: 281 pileup: 281 fast_pileup: 281 281 cgp-1-1-01[kr2]7: ~kr2/noddy/read_depth_at_pos.pl l8001.bam 2 85133126 Bio::DB::Sam version: 1.29 Raw feature_by_location: 8001 coverage: 280 pileup: 280 fast_pileup: 280 280
Subject: Re: [rt.cpan.org #69903] Coverage and pileup functions corrupted by samtools core
Date: Wed, 31 Aug 2011 18:40:14 +0100
To: bug-Bio-SamTools [...] rt.cpan.org
From: John Marshall <jm18 [...] sanger.ac.uk>
Download (untitled) / with headers
text/plain 1016b
It would be good to expose this function (though I guess it would either limit Bio-SamTools to recent enough samtools, or require #ifdef trickery), but I think the real fix is to change samtools to not arbitrarily limit depth in single-sample pileups. The depth limit appears to have been introduced to avoid running out of memory in extreme cases, which is more likely with multiple samples in "samtools mpileup". Running out of memory with a single sample is surely exceptional, and I think producing incorrect results as Keiran has seen is a worse failure mode. See also this samtools-help message and surrounding thread: http://sourceforge.net/mailarchive/message.php?msg_id=28024517 Hopefully this samtools patch will be able to be applied. John -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.