This queue is for tickets about the PDF-API2 CPAN distribution.

Report information
The Basics
Id:
66341
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
jmcgowan [...] inch.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
2.026



Subject: BUGs in PDF-API2/Filter.pm
Date: Wed, 2 Mar 2011 20:54:35 -0500
To: bug-PDF-API2@rt.cpan.org
From: John McGowan <jmcgowan@inch.com>
Filter.pm is pretty bad. The Run Length Encoder does no run length encoding (wrong back reference). The Run Length Encoder does no run length decoding (misinterprets the counter byte). The Base85 encoder outputs the base85 digits in the wrong order. The Base85 decoder has problems cleaning up the end for padded data. The LZW decompressor is set for NO early-change (the default in many ADOBE files is WITH early-change). There is an infilt2 filter (early-change) not mentioned in the doc but it expects a 13 bit reset when the dictionary is full instead of a 12 bit reset. It can handle a very short early-change file but quite quickly gets out of sync. Etc. In short ... Filter.pm is pretty bad. I had sent a report and my working version before but apparently it was not noticed.
Hi John, I recently started maintaining PDF::API2 after a couple of years of the project not having a maintainer. I'm working to create a test suite covering as much of the code as possible, and updating it to be more consistent and follow some best practices (the current code was written by multiple people over many years).
Show quoted text
> The Run Length Encoder does no run length encoding (wrong back > reference). > > The Run Length Encoder does no run length decoding (misinterprets > the counter byte). > > The Base85 encoder outputs the base85 digits in the wrong order. > The Base85 decoder has problems cleaning up the end for padded data. > > The LZW decompressor is set for NO early-change (the default in many > ADOBE files is WITH early-change). There is an infilt2 filter > (early-change) not mentioned in the doc but it expects a 13 bit > reset when the dictionary is full instead of a 12 bit reset. It can > handle a very short early- change file but quite quickly gets out of > sync.
Would you be willing to write some test cases demonstrating where the current code is broken, along with your fixed version? That would be tremendously helpful. If you're comfortable with Mercurial, you can get the most up to date code here: http://deefs.net/hg/pdfapi2 or http://bitbucket.org/ssimms/pdfapi2 Otherwise, attachments to this ticket will also be fine. Thanks, Steve Simms
Subject: Re: [rt.cpan.org #66341] BUGs in PDF-API2/Filter.pm
Date: Fri, 4 Mar 2011 14:52:47 -0500
To: bug-PDF-API2@rt.cpan.org
From: John McGowan <jmcgowan@inch.com>
On Thu, Mar 03, 2011 at 12:44:12PM -0500, Steve Simms via RT wrote:
Show quoted text
> Hi John, > > I recently started maintaining PDF::API2 after a couple of years of > the project not having a maintainer.
NOTE: I had sent this yesterday ... but was using my alternate ISP and tried to send via my usual ISP's mail server. I noticed it in my mail queue today. I see old PDF-API's (and even PDF-API3) and guess that they all have used the same Filter.pm module. If that is the case it has been broken for years and no one has noticed!
Show quoted text
> I'm working to create a test suite covering as much of the code as > possible, and updating it to be more consistent and follow some best > practices (the current code was written by multiple people over many > years).
Show quoted text
>
> > The Run Length Encoder does no run length encoding (wrong back > > reference). > > > > The Run Length Encoder does no run length decoding (misinterprets > > the counter byte). > > > > The Base85 encoder outputs the base85 digits in the wrong order. > > The Base85 decoder has problems cleaning up the end for padded data. > > > > The LZW decompressor is set for NO early-change (the default in many > > ADOBE files is WITH early-change). There is an infilt2 filter > > (early-change) not mentioned in the doc but it expects a 13 bit > > reset when the dictionary is full instead of a 12 bit reset. It can > > handle a very short early- change file but quite quickly gets out of > > sync.
> > Would you be willing to write some test cases demonstrating where the > current code is broken, along with your fixed version? That would be > tremendously helpful.
Well, test cases are not really necessary as (at least some of) the errors are well, actually, blindingly obvious (if they weren't I would not have been able to handle them!). For example, in the RLE encoder: m/^(.*?)((.)\2{2,127})(.*?)$/so) Look at the back reference, \2, which is ((.)\2{2,127}) It looks for a string which contains itself repeated multiple times! It should have been \3. The RLE encoder does absolutely NO RLE encoding (a slight bug?). In the base85 encoder: for ($j = 0; $j < 4; $j++) { $c[$j] = $b - int($b / 85) * 85 + 33; $b /= 85; } $res .= pack("C5", @c, $b + 33); it converts to base85 by finding remainders (why not "%"?) - the first one it gets is the last (least significant) base85 digit and lists them in order, least significant to most significant digit. That is backwards (most significant to least significant is the proper order, as the decoder properly handles them - the same as in base64 encoding). Looking through the code, it took me no time to notice those two errors (I may not be a programmer, but I *am* a mathematician). The LZW is a slick implementation but changes bit size without putting a limit on it (hoping it never goes over 12 bits) but with 'early change' (Adobe's default) one has to limit it when the dictionary is full and the next indicator is a reset - for that should (in the cases I have seen) be 12 bits, not 13. It took me awhile to track that down for I knew nothing about the LZW compression algorithm or Adobe's implementation (and had to get a copy of their documentation, etc.). I am no programmer (but I can manage to battle through some perl or javascript or C or Fortran II(!)) but here is what I have (I think this was sent about a bug to the email address for reporting bugs for PDF-API3, not PDF-API2 - perhaps that explains why the bugs were not addressed).
Show quoted text
> If you're comfortable with Mercurial, you can get the most up to date
I am not a programmer but ... well, I needed to be able to decode some old encodings in some malicious PDF files (some malware authors are now using such old encodings). I have not tested the following except on a few files on which I have used them (they may still screw up a bit at the end of streams which have padding). I will put the text here and add a ZIP file (.net EOLs) containing it as an attachment. [MY VERSION] #======================================================================= # # THIS IS A REUSED PERL MODULE, FOR PROPER LICENCING TERMS SEE BELOW: # # Copyright Martin Hosken <Martin_Hosken@sil.org> # # No warranty or expression of effectiveness, least of all regarding # anyone's safety, is implied in this software or documentation. # # This specific module is licensed under the Perl Artistic License. # #======================================================================= package PDF::API2::Basic::PDF::Filter; our $VERSION = '2.018'; no warnings qw[ deprecated recursion uninitialized ]; =head1 NAME PDF::API2::Basic::PDF::Filter - Abstract superclass for PDF stream filters =head1 SYNOPSIS $f = PDF::API2::Basic::PDF::Filter->new; $str = $f->outfilt($str, 1); print OUTFILE $str; while (read(INFILE, $dat, 4096)) { $store .= $f->infilt($dat, 0); } $store .= $f->infilt("", 1); =head1 DESCRIPTION A Filter object contains state information for the process of outputting and inputting data through the filter. The precise state information stored is up to the particular filter and may range from nothing to whole objects created and destroyed. Each filter stores different state information for input and output and thus may handle one input filtering process and one output filtering process at the same time. =head1 METHODS =head2 PDF::API2::Basic::PDF::Filter->new Creates a new filter object with empty state information ready for processing data both input and output. =head2 $dat = $f->infilt($str, $isend) Filters from output to input the data. Notice that $isend == 0 implies that there is more data to come and so following it $f may contain state information (usually due to the break-off point of $str not being tidy). Subsequent calls will incorporate this stored state information. $isend == 1 implies that there is no more data to follow. The final state of $f will be that the state information is empty. Error messages are most likely to occur here since if there is required state information to be stored following this data, then that would imply an error in the data. =head2 $str = $f->outfilt($dat, $isend) Filter stored data ready for output. Parallels C<infilt>. =cut sub new { my ($class) = @_; my ($self) = {}; bless $self, $class; } sub release { my ($self) = @_; return($self) unless(ref $self); # delete stuff that we know we can, here my @tofree = map { delete $self->{$_} } keys %{$self}; while (my $item = shift @tofree) { my $ref = ref($item); if (UNIVERSAL::can($item, 'release')) { $item->release(); } elsif ($ref eq 'ARRAY') { push( @tofree, @{$item} ); } elsif (UNIVERSAL::isa($ref, 'HASH')) { release($item); } } # check that everything has gone - it better had! foreach my $key (keys %{$self}) { # warn ref($self) . " still has '$key' key left after release.\n"; $self->{$key}=undef; delete($self->{$key}); } } package PDF::API2::Basic::PDF::ASCII85Decode; our $VERSION = '2.018'; use base 'PDF::API2::Basic::PDF::Filter'; use strict; no warnings qw[ deprecated recursion uninitialized ]; =head1 NAME PDF::API2::Basic::PDF::ASCII85Decode - Ascii85 filter for PDF streams. Inherits from L<PDF::API2::Basic::PDF::Filter> =cut sub outfilt { my ($self, $str, $isend) = @_; my ($res, $i, $j, $b, @c); if ($self->{'outcache'} ne "") { $str = $self->{'outcache'} . $str; $self->{'outcache'} = ""; } for ($i = 0; $i < length($str); $i += 4) { $b = unpack("N", substr($str, $i, 4)); if ($b == 0) { $res .= "z"; next; } for ($j = 0; $j < 4; $j++) { $c[$j] = $b - int($b / 85) * 85 + 33; $b /= 85; } $res .= pack("C5", @c, $b + 33); $res .= "\n" if ($i % 60 == 56); } if ($isend && $i > length($str)) { $b = unpack("N", substr($str, $i - 4) . "\000\000\000"); for ($j = 0; $j < 4; $j++) { $c[$j] = $b - int($b / 85) * 85 + 33; $b /= 85; } $res .= substr(pack("C5", @c, $b), 0, $i - length($str) + 1) . "->"; } elsif ($i > length($str)) { $self->{'outcache'} = substr($str, $i - 4); } $res; } sub infilt { my ($self, $str, $isend) = @_; my ($res, $i, $j, @c, $b, $num); $num=0; if (exists($self->{'incache'}) && $self->{'incache'} ne "") { $str = $self->{'incache'} . $str; $self->{'incache'} = ""; } $str =~ s/(\r|\n)\n?//og; for ($i = 0; $i < length($str); $i += 5) { $b = 0; if (substr($str, $i, 1) eq "z") { $i -= 4; $res .= pack("N", 0); next; } elsif ($isend && substr($str, $i, 6) =~ m/^(.{2,4})\~\>$/o) { $num = 5 - length($1); @c = unpack("C5", $1 . ("u" x (4 - $num))); # pad with 84 to sort out rounding $i = length($str); } else { @c = unpack("C5", substr($str, $i, 5)); } for ($j = 0; $j < 5; $j++) { $b *= 85; $b += $c[$j] - 33; } $res .= substr(pack("N", $b), 0, 4 - $num); } if (!$isend && $i > length($str)) { $self->{'incache'} = substr($str, $i - 5); } $res; } package PDF::API2::Basic::PDF::RunLengthDecode; our $VERSION = '2.018'; use base 'PDF::API2::Basic::PDF::Filter'; use strict; no warnings qw[ deprecated recursion uninitialized ]; =head1 NAME PDF::API2::Basic::PDF::RunLengthDecode - Run Length encoding filter for PDF streams. Inherits from L<PDF::API2::Basic::PDF::Filter> =cut sub outfilt { my ($self, $str, $isend) = @_; my ($res, $s, $r); # no state information, just slight inefficiency at block boundaries while ($str ne "") { if ($str =~ m/^(.*?)((.)\2{2,127})(.*?)$/so) { $s = $1; $r = $2; $str = $3; } else { $s = $str; $r = ''; $str = ''; } while (length($s) > 127) { $res .= pack("C", 127) . substr($s, 0, 127); substr($s, 0, 127) = ''; } $res .= pack("C", length($s)) . $s if length($s) > 0; $res .= pack("C", 257 - length($r)); } $res .= "\x80" if ($isend); $res; } sub infilt { my ($self, $str, $isend) = @_; my ($res, $l, $d); if ($self->{'incache'} ne "") { $str = $self->{'incache'} . $str; $self->{'incache'} = ""; } while ($str ne "") { $l = unpack("C", $str); if ($l == 128) { $isend = 1; return $res; } if ($l > 128) { if (length($str) < 2) { warn "Premature end to data in RunLengthEncoded data" if $isend; $self->{'incache'} = $str; return $res; } $res .= substr($str, 1, 1) x (257 - $l); substr($str, 0, 2) = ""; } else { if (length($str) < $l + 1) { warn "Premature end to data in RunLengthEncoded data" if $isend; $self->{'incache'} = $str; return $res; } $res .= substr($str, 1, $l); substr($str, 0, $l + 1) = ""; } } $res; } package PDF::API2::Basic::PDF::ASCIIHexDecode; our $VERSION = '2.018'; use base 'PDF::API2::Basic::PDF::Filter'; use strict; no warnings qw[ deprecated recursion uninitialized ]; =head1 NAME PDF::API2::Basic::PDF::ASCIIHexDecode - Ascii Hex encoding (very inefficient) for PDF streams. Inherits from L<PDF::API2::Basic::PDF::Filter> =cut sub outfilt { my ($self, $str, $isend) = @_; $str =~ s/(.)/sprintf("%02x", ord($1))/oge; $str .= ">" if $isend; $str; } sub infilt { my ($self, $str, $isend) = @_; $isend = ($str =~ s/>$//og); $str =~ s/\s//oig; $str =~ s/([0-9a-z])/pack("C", hex($1 . "0"))/oige if ($isend && length($str) & 1); $str =~ s/([0-9a-z]{2})/pack("C", hex($1))/oige; $str; } package PDF::API2::Basic::PDF::FlateDecode; our $VERSION = '2.018'; use base 'PDF::API2::Basic::PDF::Filter'; use strict; no warnings qw[ deprecated recursion uninitialized ]; our $havezlib; BEGIN { eval {require "Compress/Zlib.pm";}; $havezlib = !$@; } sub new { return undef unless $havezlib; my ($class) = @_; my ($self) = {}; $self->{'outfilt'} = Compress::Zlib::deflateInit( -Level=>9, -Bufsize=>32768, ); $self->{'infilt'} = Compress::Zlib::inflateInit(); bless $self, $class; } sub outfilt { my ($self, $str, $isend) = @_; my ($res); $res = $self->{'outfilt'}->deflate($str); $res .= $self->{'outfilt'}->flush() if ($isend); $res; } sub infilt { my ($self, $dat, $last) = @_; my ($res, $status) = $self->{'infilt'}->inflate("$dat"); $res; } package PDF::API2::Basic::PDF::LZWDecode; our $VERSION = '2.018'; use base 'PDF::API2::Basic::PDF::FlateDecode'; no warnings qw[ deprecated recursion uninitialized ]; our @basedict = map {pack("C", $_)} (0 .. 255, 0, 0); sub new { my ($class) = @_; my ($self) = {}; $self->{indict} = [@basedict]; $self->{bits} = 9; $self->{insize} = $self->{bits}; $self->{resetcode}=1<<($self->{insize}-1); $self->{endcode}=$self->{resetcode}+1; $self->{nextcode}=$self->{endcode}+1; bless $self, $class; } sub infilt { my ($self, $dat, $last) = @_; my ($num, $cache, $cache_size, $res); while ($dat ne '' || $cache_size > 0) { ($num, $cache, $cache_size) = $self->read_dat(\$dat, $cache, $cache_size, $self->{'insize'}); # this was a little arkward to comprehand # here is a better version -- fredo $self->{'insize'}++ if($self->{nextcode} == (1<<$self->{'insize'})); if($num==$self->{resetcode}) { $self->{'insize'}=$self->{bits}; $self->{nextcode}=$self->{endcode}+1; next; } elsif($num==$self->{endcode}) { last; } elsif($num<$self->{resetcode}) { $self->{'indict'}[$self->{nextcode}] = $self->{'indict'}[$num]; $res.=$self->{'indict'}[$self->{nextcode}]; $self->{nextcode}++; } elsif($num>$self->{endcode}) { $self->{'indict'}[$self->{nextcode}] = $self->{'indict'}[$num]; $self->{'indict'}[$self->{nextcode}].= substr($self->{'indict'}[$num+1],0,1); $res.=$self->{'indict'}[$self->{nextcode}]; $self->{nextcode}++; } else { die "we shouldn't be here !"; } } return $res; } sub infilt2 { my ($self, $dat, $last) = @_; my ($num, $cache, $cache_size, $res); while ($dat ne '' || $cache_size > 0) { ($num, $cache, $cache_size) = $self->read_dat(\$dat, $cache, $cache_size, $self->{'insize'}); # this was a little arkward to comprehand # here is a better version -- fredo if($num==$self->{resetcode}) { $self->{'insize'}=$self->{bits}; $self->{nextcode}=$self->{endcode}+1; next; } elsif($num==$self->{endcode}) { last; } elsif($num<$self->{resetcode}) { $self->{'indict'}[$self->{nextcode}] = $self->{'indict'}[$num]; $res.=$self->{'indict'}[$self->{nextcode}]; $self->{nextcode}++; } elsif($num>$self->{endcode}) { $self->{'indict'}[$self->{nextcode}] = $self->{'indict'}[$num]; $self->{'indict'}[$self->{nextcode}].= substr($self->{'indict'}[$num+1],0,1); $res.=$self->{'indict'}[$self->{nextcode}]; $self->{nextcode}++; } else { die "we shouldn't be here !"; } $self->{'insize'}++ if($self->{nextcode} == (1<<$self->{'insize'})); } return $res; } sub read_dat { my ($self, $rdat, $cache, $size, $len) = @_; my ($res); while ($size < $len) { $cache = ($cache << 8) + unpack("C", $$rdat); substr($$rdat, 0, 1) = ''; $size += 8; } $res = $cache >> ($size - $len); $cache &= (1 << ($size - $len)) - 1; $size -= $len; ($res, $cache, $size); } 1;

Message body not shown because it is not plain text.

Subject: Re: [rt.cpan.org #66341] BUGs in PDF-API2/Filter.pm
Date: Fri, 4 Mar 2011 14:59:58 -0500
To: bug-PDF-API2@rt.cpan.org
From: John McGowan <jmcgowan@inch.com>
Re: the base85 encoder. Looking at my code when the data has to be padded: $b = unpack("N", substr($str, $i) . "\000\000\000") If only one or two nulls should be added to pad, this may make the string too long and throw off the value of $b (too large by a factor of 256^n) Well ... as I said, I am not a programmer!
Update on this issue: I rewrote the RunLengthDecode filter last night (complete with tests, see changeset 4853928), and it should be working properly now. It will be included in release 2.021. ASCII85Decode and LZWDecode haven't been touched yet.
Subject: Re: [rt.cpan.org #66341] BUGs in PDF-API2/Filter.pm
Date: Mon, 21 Jan 2013 21:06:46 -0500 (EST)
To: Steve Simms via RT <bug-PDF-API2@rt.cpan.org>
From: John McGowan <jmcgowan@inch.com>
On Mon, 21 Jan 2013, Steve Simms via RT wrote:
Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=66341 > > > Update on this issue: > > I rewrote the RunLengthDecode filter last night (complete with tests, see > changeset 4853928), and it should be working properly now. It will be > included in release 2.021. > > ASCII85Decode and LZWDecode haven't been touched yet.
The base85 encoder puts the data in the wrong order! Apparently no one tried to used the base85 encoder and then the decoder to see if they worked. The RunLenthEncode doesn't find any runs of data to compress. The one that caught me (in examining a malicious PFF file that used LZW compression) was the LZW decompressor with early change. That's what got me to look at the file (at one time it seems that malware authors would obfuscate malicious Javascript in PDFs with chains of old filters: first dehex, then remove LZW compression (with "early change") then base 85 decode that and finally use the runlength decoder to see tha malicious code). I haven't seen that done is a while, but I don't see all the malicious PDFs out there). Have fun with it! Regards from: John McGowan | jmcgowan@inch.com [Internet Channel] --------------+-----------------------------------------------------
The ASCII85Decode filter should now encode and decode properly. If you find any exceptions, please let me know. The LZWDecode now checks for the EarlyChange parameter (default on, per the PDF spec) and has a bunch of fixes that should result in it working properly, as long as a predictor algorithm isn't being used. Both of these fixes can be found at GitHub (https://github.com/ssimms/pdfapi2) now, and will be in the upcoming 2.026 release.


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.