Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 72975
Status: stalled
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: stas [...] sysd.org
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: (no value)
Fixed in: 4.2



Subject: newline separators between block elements in as_text()
Download (untitled) / with headers
text/plain 839b
Consider the following HTML sample: <p> <span>AAA</span> BBB </p> <h2>CCC</h2> DDD HTML::Element::as_text() method stringifies it as "AAABBBCCCDDD". Despite being correct, this is far from the actual renderization in a "real" browser. links(1), lynx(1) & w3m(1) break lines this way: AAA​BBB CCC DDD​​ The attached patch tries to implement the same behavior in the as_text() method. $/ value is inserted in place of line breaks, and "\x{200b}" (Unicode zero-width space) separates text from adjacent inline elements (y/\x{200b}//d could be used to definitively collapse text; or even y/\x{200b}/\n/, when one is sure that CSS enables a <span> tag to act as a block). I'm not sure if as_text() returning strings with "\n" would break stuff; at least, 'building.t' had to be patched. Would be glad to hear your opinions.
Subject: as_text.patch
Download as_text.patch
text/x-diff 2.9k
diff -adNru HTML-Tree-4.2.orig/lib/HTML/Element.pm HTML-Tree-4.2/lib/HTML/Element.pm --- HTML-Tree-4.2.orig/lib/HTML/Element.pm 2011-04-06 05:37:54.000000000 -0300 +++ HTML-Tree-4.2/lib/HTML/Element.pm 2011-12-05 14:07:36.560782121 -0200 @@ -166,6 +166,26 @@ my $nillio = []; +# http://en.wikipedia.org/wiki/HTML_element#Block_elements +my $block_tags = { + map { $_ => 1 } qw( + p + h1 h2 h3 h4 h5 h6 + dl dt dd + ol ul li + dir + address + blockquote + center + del + div + hr + ins + noscript script + pre + ) +}; + *HTML::Element::emptyElement = \%HTML::Tagset::emptyElement; # legacy *HTML::Element::optionalEndTag = \%HTML::Tagset::optionalEndTag; # legacy *HTML::Element::linkElements = \%HTML::Tagset::linkElements; # legacy @@ -1773,10 +1793,24 @@ $text .= shift @pile; } else { # it's a ref -- traverse under it - unshift @pile, @{ $this->{'_content'} || $nillio } - unless ( $tag = ( $this = shift @pile )->{'_tag'} ) eq 'style' - or $tag eq 'script' - or ( $skip_dels and $tag eq 'del' ); + $this = shift @pile; + $tag = $this->{'_tag'}; + my @rest = @{ $this->{'_content'} || $nillio }; + + if ( exists $block_tags->{$tag} ) { + push @rest, $/; + } + elsif ( $tag eq 'br' ) { + push @rest, $/; + } + else { + push @rest, "\x{200b}"; # zero-width space (ZWSP) + } + + unshift @pile, @rest + unless $tag eq 'style' + or $tag eq 'script' + or ( $skip_dels and $tag eq 'del' ); } } return $text; diff -adNru HTML-Tree-4.2.orig/t/building.t HTML-Tree-4.2/t/building.t --- HTML-Tree-4.2.orig/t/building.t 2011-04-06 05:37:54.000000000 -0300 +++ HTML-Tree-4.2/t/building.t 2011-12-05 14:09:55.985039039 -0200 @@ -52,7 +52,10 @@ isa_ok( $div, 'HTML::Element' ); ### tests of various output formats - is( $div->as_text(), " 1 2 3 ", "Dump element in text format" ); + { + local $/ = ''; + is( $div->as_text(), " 1 2 3 ", "Dump element in text format" ); + }; is( $div->as_trimmed_text(), "1 2 3", "Dump element in trimmed text format" ); is( $div->as_text_trimmed(), "1 2 3", @@ -72,7 +75,10 @@ isa_ok( $div2, 'HTML::Element' ); ### test for RT #26436 user controlled white space - is( $div2->as_text(), " 1 &nbsp; 2 \xA0 3 ", "Dump element in text format" ); + { + local $/ = ''; + is( $div2->as_text(), " 1 &nbsp; 2 \xA0 3 ", "Dump element in text format" ); + }; is( $div2->as_trimmed_text(), "1 &nbsp; 2 \xA0 3", "Dump element in trimmed text format" ); is( $div2->as_trimmed_text( extra_chars => '&nbsp;\xA0' ),
Download (untitled) / with headers
text/plain 671b
I'm not going to merge this as-is. It would break too much code that expects the current behavior. The $block_tags hash really ought to be added to HTML::Tagset instead. One problem, though, is that <ins> and <del> are not necessarily block-level tags. They're either block-level or inline, depending on context. as_text is never going to be a proper formatter like lynx. We already have the format method and HTML::FormatText for that. I would consider a patch that added an option (like skip_dels) to add newlines after specified tags, as long as it wasn't too complex. Pull requests on https://github.com/madsen/HTML-Tree are the preferred way to send patches.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.