This queue is for tickets about the Compress-Bzip2 CPAN distribution.

Report information
The Basics
Id:
126269
Status:
new
Priority:
Low/Low

People
Owner:
Nobody in particular
Requestors:
standley [...] biken.osaka-u.ac.jp
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: truncated lined in bzreadline?
Date: Tue, 14 Aug 2018 17:08:24 +0900
To: bug-Compress-Bzip2@rt.cpan.org
From: Daron Standley <standley@biken.osaka-u.ac.jp>
Hi, I have been playing around with perl for a few hours and I am very impressed with the speed of reading a huge bz2 compressed file Just to give some numbers

Time required to read  a space-delimited bz2 file with 1000 lines of length 557780 characters (78890 integers (0-9) separated by white spaces).

python pd.read_csv(file, compression='bz2', header=0): 14 min
python subprocess('bunzip2 -c ' + file): 7 min
perl open('bunzip2 -c $file |'): 66 sec!!

So, I next started trying to use the Bzip2 module. However, I noticed the bzreadline function was returning only 4096 characters for  the files.

So, for example  I get the following when using bunzip2 :

        my $cmd="bunzip2 -c $fbz2 |";
open(FBZ,$cmd);
while(<FBZ>){
  my @line = split(/\s+/);
  printf("len %d\n",scalar(@line));
}
close(FBZ);
len 278890
len 278890
.
.
.
But when I use bzreadline as follows:

       my $bz = bzopen($fbz, "rb")
  or die "Cannot open $fbz: $bzerrno\n" ;
while ($bz->bzreadline($_) > 0 ) {
  
  my @line = split(/\s+/);
  printf("len %d\n",scalar(@line));
 }

$bz->bzclose() ;


I get


len 2048
len 2048
.
.

I am guessing there is a buffer I can set somewhere, but I couldn't figure this out by myself. if you have any clues I would be grateful.

Thanks a lot

DMS





This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.