This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id:
119186
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
jon.rubin [...] grantstreet.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: HTML::TreeBuilder parses text-only HTML improperly without trailing whitespace
Date: Thu, 8 Dec 2016 14:29:53 -0500
To: bug-HTML-Tree@rt.cpan.org
From: Jon Rubin <jon.rubin@grantstreet.com>
When attempting to parse HTML consisting of only text, and no trailing whitespace, HTML::TreeBuilder returns incorrect results:

# No whitespace
1. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text"); dd $b->guts;'
()
# Trailing whitespace
2. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text "); dd $b->guts;'
"text"
# Leading whitespace
3. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse(" text"); dd $b->guts;'
()
# Middle whitespace
4. ]$ perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text more"); dd $b->guts;'
"text"
# Middle and Trailing whitespace
5. ]$  perl -MHTML::TreeBuilder -MData::Dump -e '$b = HTML::TreeBuilder->new; $b->parse("text text "); dd $b->guts;'

Cases 1, 3, and 4 show omissions from the returned text, but adding trailing whitespace to them corrects the problem.

Unfortunately my XS-fu is not up to snuff enough to provide a patch.

Distribution: HTML-Tree-5.03
Perl Version: v5.22.2
OS: Linux/Centos6, more specifically:
]$ uname -a
Linux pexdev002-dev3.grantstreet.com 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Thanks in advance!

Jon

-- 
Jon Rubin
Grant Street Group
Ph: (412) 391-5555, Ext. 1323
It's probably just buffered as HTML::Parser won't expect the input to stop there, try calling $b->eof before calling guts.
Subject: Re: [rt.cpan.org #119186] HTML::TreeBuilder parses text-only HTML improperly without trailing whitespace
Date: Mon, 12 Dec 2016 13:11:23 -0500
To: bug-HTML-Tree@rt.cpan.org
From: Jon Rubin <jon.rubin@grantstreet.com>
Ah, that fixes my problems. Is there a reason HTML::TreeBuilder lets me call guts at all when the tree is in an incomplete state? Is there a different accessor I should be calling instead of guts for that?

Thanks,

Jon

On Mon, Dec 12, 2016 at 4:36 AM, Jeff Fearn via RT <bug-HTML-Tree@rt.cpan.org> wrote:
Show quoted text
<URL: https://rt.cpan.org/Ticket/Display.html?id=119186 >

It's probably just buffered as HTML::Parser won't expect the input to stop there, try calling $b->eof before calling guts.



--
Jon Rubin
Grant Street Group
Ph: (412) 391-5555, Ext. 1323
Probably the correct method is new_from_content which will call eof. Not sure if there is a way to detect this as it's HTML::Parsers buffer that hasn;t been flushed not HTML::*'s


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.