Skip Menu |
 

This queue is for tickets about the HTML-Tree CPAN distribution.

Report information
The Basics
Id: 19074
Status: resolved
Priority: 0/
Queue: HTML-Tree

People
Owner: Nobody in particular
Requestors: sburke [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



X-Scanned-BY: AMaViS-ng at bestpractical
Resent-Date: Thu, 4 May 2006 09:38:59 -0500
MIME-Version: 1.0 (Apple Message framework v749.3)
X-Spam-Status: No, hits=-2.6 required=8.0 tests=BAYES_00,SPF_PASS
X-Mailer: Apple Mail (2.749.3)
Resent-Message-Id: <4459CBF0.9080800 [...] cpan.org>
Received-SPF: pass (x1.develooper.com: domain of andy [...] petdance.com designates 64.81.227.163 as permitted sender)
content-type: text/plain; charset="utf-8"; format="flowed"
Resent-To: bug-html-tree [...] rt.cpan.org
Received: from localhost (localhost.localdomain [127.0.0.1]) by diesel.bestpractical.com (Postfix) with ESMTP id 9F0B44D824C for <cpan-bug+html-tree [...] diesel.bestpractical.com>; Thu, 4 May 2006 10:39:25 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 4ADDA4D80B7 for <bug-html-tree [...] rt.cpan.org>; Thu, 4 May 2006 10:39:24 -0400 (EDT)
Received: (qmail 4842 invoked by alias); 4 May 2006 14:39:12 -0000
Received: from rocket.petdance.com (HELO rocket.petdance.com) (64.81.227.163) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Thu, 04 May 2006 07:39:06 -0700
Received: from [192.168.2.6] (unknown [192.168.2.6]) by rocket.petdance.com (Postfix) with ESMTP id 25E6F2E0134; Thu, 4 May 2006 09:39:01 -0500 (CDT)
Delivered-To: cpan-bug+html-tree [...] diesel.bestpractical.com
Resent-From: Andy Lester <andy [...] petdance.com>
Subject: [Fwd: &nbsp; and \S (\s) regexp in HTML::TreeBuilder]
Return-Path: <andy [...] petdance.com>
X-Original-To: cpan-bug+html-tree [...] diesel.bestpractical.com
X-Spam-Check-BY: la.mx.develooper.com
Date: Thu, 04 May 2006 01:40:00 -0800
Message-Id: <A1C6BF9A-97DE-4021-BF46-57B74D75361C [...] cpan.org>
To: Andy Lester <andy [...] petdance.com>
Resent-CC: Tatsuhiko Miyagawa <miyagawa [...] gmail.com>
Content-Transfer-Encoding: 7bit
From: "Sean M. Burke" <sburke [...] cpan.org>
X-RT-Original-Encoding: US-ASCII
Content-Length: 1475
Download (untitled) / with headers
text/plain 1.4k
I've found an interesting (maybe corner-case) behavior of HTML::TreeBuilder handling &nbsp;s in HTML snippets. Short Version: &nbsp; is decode to U+00A0 in Unicode strings and matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's tighten/delete_ignorable_whitespaces stuff. Long Version: HTML::TreeBuilder has options called ignore_ignorable_whitespace and no_space_compacting. Here's an interesting script that behaves weirdly: use Test::More tests => 1; use HTML::TreeBuilder; my $body = "<p>&nbsp;&nbsp;</p><p>\x{34df}</p>"; my $t = HTML::TreeBuilder->new; # Uncomment these two lines and test is now fine #$t->no_space_compacting(1); #$t->ignore_ignorable_whitespace(0); $t->parse($body); $t->eof; like $t->guts->as_XML, qr/&#160;/; So, when you pass Unicode flagged string to HTML::TreeBuilder's parse() (which I think is the right thing to do to avoid bad HTML element expansion), &nbsp; will be decoded to Unicode U+00A0 (which is \xc2\xa0 in UTF-8). U+00A0 actually matches with the regular expression class \s, while plain \xa0 (latin-1 expression) doesn't. So both no_space_compacting and ignore_ignorable_whitespace options are affected by that, since they use /\S/ regular expression match. I want HTML::TreeBuilder default parameters stay the same (i.e. no_space_compacting is OFF, ignore_ignorable_whitespace is ON), but keeps &nbsp; (or &#160;) there in HTMLs because they're meaningful, in some cases.
MIME-Version: 1.0
In-Reply-To: <A1C6BF9A-97DE-4021-BF46-57B74D75361C [...] cpan.org>
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
References: <A1C6BF9A-97DE-4021-BF46-57B74D75361C [...] cpan.org>
Message-Id: <rt-3.6.HEAD-18208-1154186756-1888.19074-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
From: cjm [...] pobox.com
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 428
Download (untitled) / with headers
text/plain 428b
On Thu May 04 10:39:32 2006, SBURKE wrote: Show quoted text
> Short Version: &nbsp; is decode to U+00A0 in Unicode strings and > matches with /\s/, and thus sometimes broken by HTML::TreeBuilder's > tighten/delete_ignorable_whitespaces stuff.
I guess you missed that I had already submitted a patch for this (including a new test to make sure it works). It just hasn't been applied yet. See http://rt.cpan.org/Public/Bug/Display.html?id=17481
MIME-Version: 1.0
In-Reply-To: <A1C6BF9A-97DE-4021-BF46-57B74D75361C [...] cpan.org>
X-Mailer: MIME-tools 5.418 (Entity 5.418)
Content-Disposition: inline
Message-Id: <rt-3.6.HEAD-13518-1154840264-1247.19074-0-0 [...] rt.cpan.org>
References: <A1C6BF9A-97DE-4021-BF46-57B74D75361C [...] cpan.org>
Content-Type: text/plain; charset="utf8"
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Original-Encoding: utf-8
Content-Length: 139
Download (untitled) / with headers
text/plain 139b
Applied Chris Madsen's patch from RT 17481 which fixes this corner case to svn, and this will be resolved in the next release of HTML-Tree.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.