Skip Menu |
 

This queue is for tickets about the HTML-Scrubber CPAN distribution.

Report information
The Basics
Id: 25477
Status: open
Priority: 0/
Queue: HTML-Scrubber

People
Owner: Nobody in particular
Requestors: nab83 [...] yahoo.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



Subject: self closing tags
Date: Thu, 15 Mar 2007 17:43:36 -0700 (PDT)
To: bug-HTML-Scrubber [...] rt.cpan.org
From: nabeel mohammed <nab83 [...] yahoo.com>
Download (untitled) / with headers
text/plain 846b
Hi, I am trying to use HTML::Scrubber to clean some script tags and get the rest of the html. Here is an html fragment I am using: <script src="www.google.com/script.js" /> <b> this is a line of bold </b> <script type="text/javascript"> alert("hello") </script> <h> this is a line of bold </h> And here is the perl code I am running: my $scrubber = new HTML::Scrubber; $scrubber->default(1); my $scrubbed = $scrubber->scrub( $text ); print "$scrubbed"; All I see printed is <h> this is a line of bold </h> Now I might be missing something really obvious, but I can't for figure it out. Thanks Nabeel Show quoted text
____________________________________________________________________________________ Looking for earth-friendly autos? Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. http://autos.yahoo.com/green_center/
From: trendele [...] imtek.de
Download (untitled) / with headers
text/plain 1.2k
This is because HTML::Parser ignores self-closing tags by default, and HTML::Scrubber does not set empty_element_tags(). I suggest adding this to HTML::Scrubber. In the meantime, you can set it manually as a workaround: my $scrubber = HTML::Scrubber->new; $scrubber->{_p}->empty_element_tags(1); Now your example should work again. On Thu Mar 15 22:30:41 2007, nab83@yahoo.com wrote: Show quoted text
> Hi, > I am trying to use HTML::Scrubber to clean some script tags and get > the > rest of the html. Here is an html fragment I am using: > > <script src="www.google.com/script.js" /> > > > <b> this is a line of bold </b> > > <script type="text/javascript"> > alert("hello") > </script> > > <h> this is a line of bold </h> > > > And here is the perl code I am running: > > my $scrubber = new HTML::Scrubber; > $scrubber->default(1); > my $scrubbed = $scrubber->scrub( $text ); > > print "$scrubbed"; > > All I see printed is > > > <h> this is a line of bold </h> > > Now I might be missing something really obvious, but I can't for > figure it > out. > Thanks > Nabeel > > > > > > >
Show quoted text
____________________________________________________________________________________
> Looking for earth-friendly autos? > Browse Top Cars by "Green Rating" at Yahoo! Autos' Green Center. > http://autos.yahoo.com/green_center/
Download (untitled) / with headers
text/plain 589b
On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote: Show quoted text
> This is because HTML::Parser ignores self-closing tags by default, and > HTML::Scrubber does not set empty_element_tags(). > I suggest adding this to HTML::Scrubber. In the meantime, you can set > it > manually as a workaround: > > my $scrubber = HTML::Scrubber->new; > $scrubber->{_p}->empty_element_tags(1);
This proposed patch would cause another test to fail in t/07_booleans. In particular, after parsing, this: <br /> would become: <br></br> That result is with 3.56. Maybe newer HTML::Parsers are smarter. Mark
Download (untitled) / with headers
text/plain 925b
On Wed Apr 22 17:43:34 2009, MARKSTOS wrote: Show quoted text
> On Sun Jun 22 08:03:11 2008, trendele@imtek.de wrote:
> > This is because HTML::Parser ignores self-closing tags by default, and > > HTML::Scrubber does not set empty_element_tags(). > > I suggest adding this to HTML::Scrubber. In the meantime, you can set > > it > > manually as a workaround: > > > > my $scrubber = HTML::Scrubber->new; > > $scrubber->{_p}->empty_element_tags(1);
> > This proposed patch would cause another test to fail in t/07_booleans. > In particular, after parsing, this: > > <br /> > would become: > <br></br>
On further review, I think this is acceptable behavior. When viewed under an XHTML transitional or 'strict' doctype, this renders as a single line break: <br></br> In quirks mode, it would count as two line breaks. I think then this behavior is "good enough" and the resolution can be update the tests to reflect this behavior.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.