Skip Menu |
 

This queue is for tickets about the XML-Twig CPAN distribution.

Report information
The Basics
Id: 86633
Status: resolved
Priority: 0/
Queue: XML-Twig

People
Owner: Nobody in particular
Requestors: melmothx [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: 3.48
Fixed in: (no value)



From melmothx [...] gmail.com Tue Jul 2 08: 47:39 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-6.22 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
X-Spam-Flag: NO
content-type: text/plain; charset="utf-8"
Message-ID: <85txkdm8o0.fsf [...] demian.krase.net>
X-Received: by 10.15.110.10 with SMTP id cg10mr26274714eeb.57.1372769245213; Tue, 02 Jul 2013 05:47:25 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Spam-Score: -6.22
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id D6C41240A72 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 08:47:39 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iPgAUt+vPsm2 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 08:47:38 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 14C0B240366 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 2 Jul 2013 08:47:37 -0400 (EDT)
Received: (qmail 13147 invoked by alias); 2 Jul 2013 12:47:37 -0000
Received: from mail-ea0-f180.google.com (HELO mail-ea0-f180.google.com) (209.85.215.180) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 02 Jul 2013 05:47:31 -0700
Received: by mail-ea0-f180.google.com with SMTP id k10so2783936eaj.11 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 02 Jul 2013 05:47:25 -0700 (PDT)
Received: from localhost ([37.244.220.184]) by mx.google.com with ESMTPSA id w43sm36382800eez.6.2013.07.02.05.47.23 for <bug-XML-Twig [...] rt.cpan.org> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Tue, 02 Jul 2013 05:47:24 -0700 (PDT)
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
Subject: safe_parse_html fails to parse valid HTML with entities (in some cases)
Return-Path: <melmothx [...] gmail.com>
X-RT-Mail-Extension: xml-twig
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-Spam-Check-BY: la.mx.develooper.com
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:date:message-id:user-agent:mime-version :content-type; bh=AV7Nuzas4nfWSbFoY1Vmtc98rQCAE4WM4+t2fOO2vVs=; b=mRLhB4q6jBMKfOUtuN1d0DkQTTJZkU7XCH0xXG7xkPbONEY9c4RaoX47PLYLA6dR4j gNmcXG3d/MRx5I95lX1xxzvbKmoTCXwMwf0z8SflEeaJSHnuDV4QL6jdaXroQOVGq4d1 qEwCD2wkmcWjkrPmzzPpCwNAP2k2GIepEkGBjQlyRpyzxJZNyvO7whAJ7mtuIbOeFuHC iSrU6qwaB15X33jUvzIkPcz9OchQeRtKp/yrxPs+ZRCeUAfDl9gpOUXa74OfaClw18F6 Wj9COFwbes+C5xFFE2U8JeQCKLpQVvITjkoeNsxLqQy9FTdhx6wzeXUZbUYs3FdzTRPp ZCSA==
Date: Tue, 02 Jul 2013 14:46:23 +0200
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
From: Marco Pessotto <melmothx [...] gmail.com>
X-RT-Original-Encoding: ascii
X-RT-Interface: Email
Content-Length: 1017
Download (untitled) / with headers
text/plain 1017b
Hello there! It looks like that entities (at least the very common '&amp;') is mangled if it's followed by a letter. The test script below illustrates the problem, which contains perfectly valid HTML snippets. While testing, I found that adding to the method "_html2xml" this option: $tree->no_expand_entities(1); seems to fix the problem, but I'm not sure at all it will not trigger other problems or undesired behaviour. It's also possible the bug resides in HTML::TreeBuilder, but this I leave to you to decide. Best wishes Version used: XML::Twig is up to date. (3.44) HTML::TreeBuilder is up to date. (5.03) #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Test::More; plan tests => 2; my $parser = new XML::Twig (); my $value =<< 'EOF'; <h1>Here&amp;there</h1> EOF my $html = $parser->safe_parse_html($value); print $@ if $@; ok($html); $value =<< 'EOF'; <h1>Here &amp; there</h1> EOF $html = $parser->safe_parse_html($value); print $@ if $@; ok($html); __END__ -- Marco
From xmltwig [...] gmail.com Tue Jul 2 09: 21:30 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-4.229 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, HK_RANDOM_ENVFROM=0.001, HK_RANDOM_FROM=0.99, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org>
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-86633 [...] rt.cpan.org> <85txkdm8o0.fsf [...] demian.krase.net> <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.204.61.10 with SMTP id r10mr466560bkh.64.1372771274765; Tue, 02 Jul 2013 06:21:14 -0700 (PDT)
Message-ID: <51D2D3C7.4060003 [...] gmail.com>
content-type: text/plain; charset="utf-8"; format="flowed"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -4.229
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id C2A05240B8D for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 09:21:30 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6SHczG5CIF6s for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 09:21:29 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id E056D240366 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 2 Jul 2013 09:21:28 -0400 (EDT)
Received: (qmail 15408 invoked by alias); 2 Jul 2013 13:21:28 -0000
Received: from mail-bk0-f43.google.com (HELO mail-bk0-f43.google.com) (209.85.214.43) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 02 Jul 2013 06:21:19 -0700
Received: by mail-bk0-f43.google.com with SMTP id jm2so2309211bkc.2 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 02 Jul 2013 06:21:14 -0700 (PDT)
Received: from [192.168.2.200] (net-2-35-144-77.cust.dsl.vodafone.it. [2.35.144.77]) by mx.google.com with ESMTPSA id eu16sm11158713bkc.0.2013.07.02.06.21.13 for <bug-XML-Twig [...] rt.cpan.org> (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 02 Jul 2013 06:21:13 -0700 (PDT)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130623 Thunderbird/17.0.7
Subject: Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
Return-Path: <xmltwig [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=IIY2B0wQb20JoUulg2za+UVaXUfvETeRNG4j1F52Kj4=; b=djjarjUKvEZOuBVy4UnY68ObrlW54mChSX24QXb28/1iDxCuDl1ZUZiM9BOxyX24bV AMXnsJgUqpSCyZlS6aKjIgxhi7WIuLLrx9js2lcY4z3OWBwyqHavwugWQ00ASccE53l+ c0lNsl0/dCnKZ3cZ+/1rlbgSAArfvJjGhhbCsHqqtUizUT1T6LplS7NntLWpncDp6KUx JKdxEimz3RL/dMVpAViE1w/+fnOQXDk99ODPuftcAi6GxKOiovD2KMn0y4OYmCZlDwyb 1xpM+PJTwaBmbfjNfiHjvycICvficZukRny8FhjS2/S1fAoSQYyC0KeETMcrpjA+kb78 1Z6g==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
Date: Tue, 02 Jul 2013 13:21:11 +0000
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
Content-Transfer-Encoding: 8bit
From: mirod <xmltwig [...] gmail.com>
RT-Message-ID: <rt-4.0.13-10245-1372771291-246.86633-0-0 [...] rt.cpan.org>
Content-Length: 2065
It does look like an HTML::Element bug, because if you use HTML::Tidy as the html to xml converter the bug disappears. my $parser = new XML::Twig ( use_tidy => 1); and voilà! I will look into this though. I have never been really happy with HTML::Element's xml conversion, but since HTML::Tidy is a bit more of pain to install (or was at least), HTML::TreeBuilder is still the default option. Plus changing this would likely cause back-compatibility problems. -- michel On 07/02/2013 12:47 PM, Marco Pessotto via RT wrote: Show quoted text
> Tue Jul 02 08:47:40 2013: Request 86633 was acted upon. > Transaction: Ticket created by melmothx@gmail.com > Queue: XML-Twig > Subject: safe_parse_html fails to parse valid HTML with entities (in some cases) > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: melmothx@gmail.com > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > > > Hello there! > > It looks like that entities (at least the very common '&amp;') is > mangled if it's followed by a letter. The test script below illustrates > the problem, which contains perfectly valid HTML snippets. > > While testing, I found that adding to the method "_html2xml" this option: > > $tree->no_expand_entities(1); > > seems to fix the problem, but I'm not sure at all it will not trigger > other problems or undesired behaviour. > > It's also possible the bug resides in HTML::TreeBuilder, but this I > leave to you to decide. > > Best wishes > > > Version used: > XML::Twig is up to date. (3.44) > HTML::TreeBuilder is up to date. (5.03) > > > #!/usr/bin/env perl > > use strict; > use warnings; > use XML::Twig; > use Test::More; > plan tests => 2; > > my $parser = new XML::Twig (); > my $value =<< 'EOF'; > <h1>Here&amp;there</h1> > EOF > > > my $html = $parser->safe_parse_html($value); > print $@ if $@; > ok($html); > > $value =<< 'EOF'; > <h1>Here &amp; there</h1> > EOF > > $html = $parser->safe_parse_html($value); > print $@ if $@; > ok($html); > > __END__ > >
From melmothx [...] gmail.com Tue Jul 2 09: 59:16 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.72 tagged_above=-99.9 required=10 tests=[AWL=-0.500, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org> (bug-XML-Twig [...] rt.cpan.org's message of "Tue, 2 Jul 2013 09:21:32 -0400")
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-86633 [...] rt.cpan.org> <85txkdm8o0.fsf [...] demian.krase.net> <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org> <51D2D3C7.4060003 [...] gmail.com> <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.15.42.72 with SMTP id t48mr26258816eev.105.1372773529755; Tue, 02 Jul 2013 06:58:49 -0700 (PDT)
Message-ID: <85zju5kqsk.fsf [...] demian.krase.net>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.72
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 56AF6240B8D for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 09:59:16 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AC-VvXCxf7uY for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Tue, 2 Jul 2013 09:59:12 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id BA84F240A72 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 2 Jul 2013 09:59:11 -0400 (EDT)
Received: (qmail 18348 invoked by alias); 2 Jul 2013 13:59:11 -0000
Received: from mail-ea0-f175.google.com (HELO mail-ea0-f175.google.com) (209.85.215.175) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Tue, 02 Jul 2013 06:59:02 -0700
Received: by mail-ea0-f175.google.com with SMTP id z7so2741345eaf.20 for <bug-XML-Twig [...] rt.cpan.org>; Tue, 02 Jul 2013 06:58:49 -0700 (PDT)
Received: from localhost ([37.244.220.184]) by mx.google.com with ESMTPSA id n45sm36766741eew.1.2013.07.02.06.58.48 for <bug-XML-Twig [...] rt.cpan.org> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Tue, 02 Jul 2013 06:58:49 -0700 (PDT)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
Return-Path: <melmothx [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:references:date:in-reply-to:message-id:user-agent :mime-version:content-type:content-transfer-encoding; bh=vK/4JTEFSaVlpTNsgXxkOkJ6vfEUev5uzNGp66Pjd3Y=; b=ColKfbt9QRZ9Lzt47dMw0fr9OvlvuJPQPR9TehnTX3j47wSz0utWgEFyQbS8E5wiaS ff6aFafBHX1Gr9ckvNTBRRXNoeL9+xYXJmF+glPWzK3QOtTGR7C0sc7eEnM5/8QnE6Av Ryt9GI9nsRolHJs6NxF03436mEKB6O/XBnbExImcy8gpdCYe85+Rxa/UVPLxnrdMq8Ki idZwqgntdyEEbYxpIA/LhOsPJ0Ps4BZx8fmRS7IGuDMWLcrgd2F/ZnhkiBcIFvk5Qx88 NyU8SjWor1eZrH+aMzvcRM4uMcVGRcpIIP+Voo77fcFgBIMD8+SpDK/2mSLEFOUC/66y cRvA==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
Date: Tue, 02 Jul 2013 15:57:47 +0200
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
From: Marco Pessotto <melmothx [...] gmail.com>
RT-Message-ID: <rt-4.0.13-20151-1372773557-307.86633-0-0 [...] rt.cpan.org>
Content-Length: 717
Download (untitled) / with headers
text/plain 717b
"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > It does look like an HTML::Element bug, because if you use HTML::Tidy as > the html to xml converter the bug disappears. > > my $parser = new XML::Twig ( use_tidy => 1); > > and voilà! > > I will look into this though. I have never been really happy with > HTML::Element's xml conversion, but since HTML::Tidy is a bit more of > pain to install (or was at least), HTML::TreeBuilder is still the > default option. Plus changing this would likely cause back-compatibility > problems.
This indeed seems to fix the issue. Thanks a lot for the amazing fast reply! Best wishes -- Marco
From melmothx [...] gmail.com Wed Jul 3 14: 55:06 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.22 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org> (bug-XML-Twig [...] rt.cpan.org's message of "Tue, 2 Jul 2013 09:21:32 -0400")
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-86633 [...] rt.cpan.org> <85txkdm8o0.fsf [...] demian.krase.net> <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org> <51D2D3C7.4060003 [...] gmail.com> <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.15.65.8 with SMTP id p8mr2514959eex.110.1372877683236; Wed, 03 Jul 2013 11:54:43 -0700 (PDT)
Message-ID: <857gh7zd8w.fsf [...] demian.krase.net>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.22
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id CB8C0240E48 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Wed, 3 Jul 2013 14:55:06 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id yuPdQLFLoK05 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Wed, 3 Jul 2013 14:55:05 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 648CC240427 for <bug-XML-Twig [...] rt.cpan.org>; Wed, 3 Jul 2013 14:55:05 -0400 (EDT)
Received: (qmail 29201 invoked by alias); 3 Jul 2013 18:55:04 -0000
Received: from mail-ea0-f171.google.com (HELO mail-ea0-f171.google.com) (209.85.215.171) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Wed, 03 Jul 2013 11:54:56 -0700
Received: by mail-ea0-f171.google.com with SMTP id m14so286195eaj.30 for <bug-XML-Twig [...] rt.cpan.org>; Wed, 03 Jul 2013 11:54:43 -0700 (PDT)
Received: from localhost ([94.250.143.171]) by mx.google.com with ESMTPSA id y1sm46178074eew.3.2013.07.03.11.54.41 for <multiple recipients> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Wed, 03 Jul 2013 11:54:42 -0700 (PDT)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
Return-Path: <melmothx [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:references:date:in-reply-to:message-id:user-agent :mime-version:content-type:content-transfer-encoding; bh=VWcgwAvViRw1Qb49ijBVwPBse71mtOsuwQ32MCuk4zM=; b=nyF5SmgHajUi2sYZpA4gg3xo9KATEPQ1wrvMmngXSz2fJRDlbA1DWNQThrbMld02g3 rd+Vxh24o1y42070TpaNtxGWX1SJBPJ6IM6pphHh2ICGeWdNEX2Jhex4dX6tEyMbLy36 DZPbrakEt4/Qvmbbs0AyCyWPlvMsJ3Hg4l3Cj8uiXfYz8F2YChresXRrMzNmymwNs6/W ia+ukGqXlrkShEb7RGRh3KObOaPPQCmL249JSPHu039CUl6WL/7Z7+IsJqBWgUUugtPI 4TrB5bN2P1knmYB0B16XOUW1WliIKErNYNRwwX31lL/IzngPtFNxhhN3FcTO2LU1yOyY rymA==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
Date: Wed, 03 Jul 2013 20:53:35 +0200
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
From: Marco Pessotto <melmothx [...] gmail.com>
RT-Message-ID: <rt-4.0.13-4732-1372877707-1972.86633-0-0 [...] rt.cpan.org>
Content-Length: 1801
Download (untitled) / with headers
text/plain 1.7k
"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: Show quoted text
> <URL: https://rt.cpan.org/Ticket/Display.html?id=86633 > > > It does look like an HTML::Element bug, because if you use HTML::Tidy as > the html to xml converter the bug disappears. > > my $parser = new XML::Twig ( use_tidy => 1); > > and voilà! > > I will look into this though. I have never been really happy with > HTML::Element's xml conversion, but since HTML::Tidy is a bit more of > pain to install (or was at least), HTML::TreeBuilder is still the > default option. Plus changing this would likely cause back-compatibility > problems.
Actually, tidy has other issues and seems to eat (at least) the style attributes: I've updated the test script to look so: #!/usr/bin/env perl use strict; use warnings; use XML::Twig; use Test::More; plan tests => 6; my $tidy_parser = new XML::Twig ( use_tidy => 1); my $default_parser = new XML::Twig; my $value =<< 'EOF'; <h1>Here&amp;there</h1> EOF my $html = $tidy_parser->safe_parse_html($value); print $@ if $@; ok($html, "tidy ok"); $html = $default_parser->safe_parse_html($value); print $@ if $@; ok($html, "default ok"); $value =<< 'EOF'; <h1 style="display:none">Here &amp; there</h1> EOF $html = $default_parser->safe_parse_html($value); print $@ if $@; ok($html); $html = $tidy_parser->safe_parse_html($value); print $@ if $@; ok($html); $html = $tidy_parser->safe_parse_html($value); my @elts = $html->root()->get_xpath("//body"); is($elts[0]->first_child->{att}->{style}, "display:none", "style found with tidy converter"); $html = $default_parser->safe_parse_html($value); @elts = $html->root()->get_xpath("//body"); is($elts[0]->first_child->{att}->{style}, "display:none", "style found with default converter"); __END__ Best wishes -- Marco
From melmothx [...] gmail.com Thu Jul 4 04: 45:30 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.22 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FREEMAIL_REPLY=1, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
In-Reply-To: <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org> (bug-XML-Twig [...] rt.cpan.org's message of "Tue, 2 Jul 2013 09:21:32 -0400")
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-86633 [...] rt.cpan.org> <85txkdm8o0.fsf [...] demian.krase.net> <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org> <51D2D3C7.4060003 [...] gmail.com> <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.15.94.11 with SMTP id ba11mr5209359eeb.101.1372927507252; Thu, 04 Jul 2013 01:45:07 -0700 (PDT)
Message-ID: <8538rupve8.fsf [...] demian.krase.net>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.22
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id AE0F8240E84 for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Thu, 4 Jul 2013 04:45:30 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id x2AdtJmabq+l for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Thu, 4 Jul 2013 04:45:29 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 71CA8240E6C for <bug-XML-Twig [...] rt.cpan.org>; Thu, 4 Jul 2013 04:45:29 -0400 (EDT)
Received: (qmail 19740 invoked by alias); 4 Jul 2013 08:45:28 -0000
Received: from mail-ea0-f180.google.com (HELO mail-ea0-f180.google.com) (209.85.215.180) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Thu, 04 Jul 2013 01:45:20 -0700
Received: by mail-ea0-f180.google.com with SMTP id k10so614436eaj.39 for <bug-XML-Twig [...] rt.cpan.org>; Thu, 04 Jul 2013 01:45:07 -0700 (PDT)
Received: from localhost ([77.237.118.246]) by mx.google.com with ESMTPSA id p49sm3557383eeu.2.2013.07.04.01.45.04 for <multiple recipients> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Thu, 04 Jul 2013 01:45:06 -0700 (PDT)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
Return-Path: <melmothx [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:references:date:in-reply-to:message-id:user-agent :mime-version:content-type; bh=5UJqzZUUXULNjWc8tIptM/8JJ4egMP8Y13KRtka1y0c=; b=wLxJYLnOikohZYwaR6dGG1WaZIH0QdXX+aiUVnAAkO6gWUyAmkxCTAo7zT9s4QrSXm E5d6iPN5RIorbndXEuls78LYC3Pya3GQHaa4rCqxp5j3Z1PuF4yJTFfZDHwtqhzugm+B TUV9Pig8MMLXqKEQzQA+JiJPvFl6m0QvMCWs0H0m9U9gzadqpM6Hw4TFkXTxLtvEIAM0 XXgx7/Ky+lwRHHQX7lwh53WGdQq2RIATr4MNaOEFoFLTMe4xubwwzABNI4X7ZRb3JHJF yu18LiN4bJenidYHlO/SpQmsshx2WUS8FvYpBhV6KHEvq8eju8luS9VxZV3mF0aCvhwk WstA==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
Date: Thu, 04 Jul 2013 10:43:59 +0200
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
From: Marco Pessotto <melmothx [...] gmail.com>
RT-Message-ID: <rt-4.0.13-11139-1372927531-1283.86633-0-0 [...] rt.cpan.org>
Content-Length: 735
Download (untitled) / with headers
text/plain 735b
"xmltwig@gmail.com via RT" <bug-XML-Twig@rt.cpan.org> writes: More info. This seems to happen only with the latest versions of Twig. Using the version provided by debian stable and running the provided test script I get: 1..6 ok 1 - tidy ok ok 2 - default ok ok 3 ok 4 not ok 5 - style found with tidy converter # Failed test 'style found with tidy converter' # at t/twig.t line 39. # got: undef # expected: 'display:none' ok 6 - style found with default converter Then upgrading HTML::Tree from 5.02 to 5.03 the test still works. Upgrading Twig from 3.39 (debian version) to 3.44 fails. I've tried to diff the two version, but I guess you're WAY more qualified than me to spot the bug. Best wishes. -- Marco
From melmothx [...] gmail.com Mon Jul 22 12: 18:35 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-5.97 tagged_above=-99.9 required=10 tests=[AWL=0.250, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779] autolearn=ham
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-86633 [...] rt.cpan.org> <85txkdm8o0.fsf [...] demian.krase.net> <rt-4.0.13-32540-1372769260-1455.86633-4-0 [...] rt.cpan.org> <51D2D3C7.4060003 [...] gmail.com> <rt-4.0.13-10245-1372771291-444.86633-6-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 10.14.184.4 with SMTP id r4mr28430410eem.100.1374509898425; Mon, 22 Jul 2013 09:18:18 -0700 (PDT)
Message-ID: <87hafmy3eg.fsf [...] universe.krase.net>
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
X-Spam-Score: -5.97
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 1B5342403CC for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Mon, 22 Jul 2013 12:18:35 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id raV7drGUTC-g for <cpan-bug+XML-Twig [...] hipster.bestpractical.com>; Mon, 22 Jul 2013 12:18:30 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 800652403C5 for <bug-XML-Twig [...] rt.cpan.org>; Mon, 22 Jul 2013 12:18:30 -0400 (EDT)
Received: (qmail 27472 invoked by alias); 22 Jul 2013 16:18:29 -0000
Received: from mail-ea0-f171.google.com (HELO mail-ea0-f171.google.com) (209.85.215.171) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Mon, 22 Jul 2013 09:18:22 -0700
Received: by mail-ea0-f171.google.com with SMTP id m14so3904989eaj.16 for <bug-XML-Twig [...] rt.cpan.org>; Mon, 22 Jul 2013 09:18:18 -0700 (PDT)
Received: from localhost ([37.244.214.170]) by mx.google.com with ESMTPSA id cg12sm51894222eeb.7.2013.07.22.09.18.16 for <bug-XML-Twig [...] rt.cpan.org> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Mon, 22 Jul 2013 09:18:17 -0700 (PDT)
Delivered-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #86633] safe_parse_html fails to parse valid HTML with entities (in some cases)
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux)
Return-Path: <melmothx [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:to:subject:references:date:message-id:user-agent:mime-version :content-type; bh=XK+2mXjz17XOCP0ReceKVKlydMUf6FmNtoO848/hoo4=; b=RFLnt35CXZiLAARjxekQodxgF9VAd4zUrwE4AWVAFbgUYkO+npfnVn2ei+1LdRo0sU T7G0OeUucctQ/MokfqmB89sQ39Dq7Y22+3v0JCtvbsZnvJFBLB1FdoqafgLisTjFVOeX o2rnE3r4qCucEn8B/LiT9yvGltaA7XYOkEZRFzva6Noe7mLCGtpBOAmOOHYCh6VFNj+K +9tZRECFtrSv3/3PCKGBjIqJZCjciAArAdWi+Z4Zm90fJGd01BAsnq0QrsaYBOGezfc6 64DgwjKwHfNmnKrOW/hMDWD7ijPGbbpmS3rnRqpBEWcH7f4UWZ+eFkwjWeOH838ImgKe 2LQg==
X-Spam-Check-BY: la.mx.develooper.com
X-Original-To: cpan-bug+XML-Twig [...] hipster.bestpractical.com
X-RT-Mail-Extension: xml-twig
Date: Mon, 22 Jul 2013 18:18:15 +0200
X-Spam-Level:
To: bug-XML-Twig [...] rt.cpan.org
From: Marco Pessotto <melmothx [...] gmail.com>
RT-Message-ID: <rt-4.0.13-6350-1374509915-56.86633-0-0 [...] rt.cpan.org>
Content-Length: 2546
Download (untitled) / with headers
text/plain 2.4k
Further investigations led to the routine _xml_escape which is called by the local fork of HTML::Element::as_XML. _xml_escape doesn't escape entities "already escaped". But the XML parser is set to expand the entities, so we work with just plain text here. sub _xml_escape { my( $html)= @_; $html =~ s{&(?! # An ampersand that isn't followed by... ( \#[0-9]+; | # A hash mark, digits and semicolon, or \#x[0-9a-fA-F]+; | # A hash mark, "x", hex digits and semicolon, or [\w]+ # A valid unicode entity name and semicolon ) ) } {&amp;}gx; # Needs to be escaped to amp In first place, the regexp seems wrong on the 4th line, because the semicolon seems missing. But if the parser returns us the string "&amp;", how can we know if the source was "&amp;amp;" or just "&amp;" or "&"? Also, there is no guarantee that the unicode entity name is valid. Then if we have the string "by Marco&company" returned by the parser, which originated from the following legal string: "Marco&amp;company", we get an invalid entity unescaped on which the parser will crash badly. To me the _xml_escape should looks so (IMVHO): sub _xml_escape { my( $html)= @_; # entities are already expanded in the treebuilder, so just escape them # simple character escapes # warn "escaping $html"; $html =~ s/&/&amp;/g; $html =~ s/</&lt;/g; $html =~ s/>/&gt;/g; $html =~ s/"/&quot;/g; $html =~ s/'/&apos;/g; # warn "returning $html"; return $html; } Of course, this could break something else I'm not aware of (unsure about the CDATA sections). Test script: use strict; use warnings; use XML::Twig; use Test::More; use HTML::TreeBuilder; plan tests => 2; # emulate the tree builder with the same options my $tree= HTML::TreeBuilder->new; $tree->ignore_ignorable_whitespace( 0); $tree->ignore_unknown( 0); $tree->no_space_compacting( 1); $tree->store_comments( 1); $tree->store_pis(1); $tree->parse("<h1>Marco&amp;company</h1>"); $tree->eof; my $tb = $tree->as_XML; is ($tb, "<html><head></head><body><h1>Marco&amp;company</h1></body></html>\n"); diag "Expecting: $tb;"; my $parser = XML::Twig->new(); my $html = $parser->safe_parse_html("<h1>Marco&amp;company</h1>"); diag $@ if $@; is($html->sprint . "\n", $tb, "treebuilder and twig yield the same (with trailing linefeed)"); __END__ I hope this help. Best wishes -- Marco Pessotto


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.