Skip Menu |
 

This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id: 83570
Status: open
Priority: 0/
Queue: HTML-Parser

People
Owner: Nobody in particular
Requestors: ceklof [...] thanxmedia.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



From ceklof [...] thanxmedia.com Sat Feb 23 12: 43:30 2013
MIME-Version: 1.0
X-Spam-Status: No, score=-2.402 tagged_above=-99.9 required=10 tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_HI=-5, SPF_SOFTFAIL=0.665, T_HTML_ATTACH=0.01, URI_HEX=1.122] autolearn=ham
X-Spam-Flag: NO
Content-Language: en-US
Content-Type: multipart/mixed; boundary="_005_20A53DFAC3F98343A91E7915018A9B2101B11BORD2MBX04Gmex05ml_"
Message-ID: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Virus-Scanned: OK
X-MS-Tnef-Correlator:
X-Spam-Score: -2.402
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 27CBF240656 for <cpan-bug+HTML-Parser [...] hipster.bestpractical.com>; Sat, 23 Feb 2013 12:43:30 -0500 (EST)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id iFssx-3lrNtM for <cpan-bug+HTML-Parser [...] hipster.bestpractical.com>; Sat, 23 Feb 2013 12:43:02 -0500 (EST)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 390622404C0 for <bug-HTML-Parser [...] rt.cpan.org>; Sat, 23 Feb 2013 12:43:00 -0500 (EST)
Received: (qmail 24480 invoked by uid 103); 23 Feb 2013 17:43:00 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 23 Feb 2013 17:43:00 -0000
Received: from smtp129.ord.emailsrvr.com (HELO smtp129.ord.emailsrvr.com) (173.203.6.129) by 16.mx.develooper.com (qpsmtpd/0.84/v0.84-167-g4ed6cab) with ESMTP; Sat, 23 Feb 2013 09:42:37 -0800
Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp21.relay.ord1a.emailsrvr.com (SMTP Server) with ESMTP id 23A2D300273 for <bug-HTML-Parser [...] rt.cpan.org>; Sat, 23 Feb 2013 12:42:34 -0500 (EST)
Received: from smtp192.mex05.mlsrvr.com (unknown [184.106.31.85]) by smtp21.relay.ord1a.emailsrvr.com (SMTP Server) with ESMTPS id D9705300161 for <bug-HTML-Parser [...] rt.cpan.org>; Sat, 23 Feb 2013 12:42:33 -0500 (EST)
Received: from ORD2MBX04G.mex05.mlsrvr.com ([fe80::b01b:20ff:fe52:4153]) by ORD2HUB06.mex05.mlsrvr.com ([fe80::20b1:196c:7e23:928%20]) with mapi id 14.02.0328.009; Sat, 23 Feb 2013 11:42:31 -0600
Delivered-To: cpan-bug+HTML-Parser [...] hipster.bestpractical.com
Subject: Incorrect tokenization in HTML::Parser
Return-Path: <ceklof [...] thanxmedia.com>
X-RT-Mail-Extension: html-parser
X-Original-To: cpan-bug+HTML-Parser [...] hipster.bestpractical.com
X-Spam-Check-BY: 16.mx.develooper.com
Thread-Index: Ac4R6q9yeOdqMGkwQGWvJMUmruBqlg==
Date: Sat, 23 Feb 2013 17:42:30 +0000
X-Spam-Level:
X-MS-Has-Attach: yes
Thread-Topic: Incorrect tokenization in HTML::Parser
X-Originating-Ip: [24.215.232.125]
Accept-Language: en-US
To: "bug-HTML-Parser [...] rt.cpan.org" <bug-HTML-Parser [...] rt.cpan.org>
From: Carl Eklof <ceklof [...] thanxmedia.com>
Content-Length: 0
Content-Type: multipart/alternative; boundary="_000_20A53DFAC3F98343A91E7915018A9B2101B11BORD2MBX04Gmex05ml_"
Content-Length: 0
content-type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: us-ascii
Content-Length: 2589
Download (untitled) / with headers
text/plain 2.5k
Hi Gisle, First, thank you for all of your huge contributions to Perl over the years! I've discovered a site (http://www.scotts.com/) that has HTML that HTML-Parser does not tokenize correctly. Envs (tried on two machines, same results): * HTML::Parser (3.65 and 3.69) * Perl 5.14.2, and 5.10.1 * 'full_uname' => 'Linux 449876-app3.blosm.com 2.6.18-238.37.1.el5 #1 SMP Fri Apr 6 13:47:10 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux', * 'os_distro' => 'Red Hat Enterprise Linux Server release 5.9 (Tikanga) Kernel \\r on an \\m<file:///\\m>', * 'full_uname' => 'Linux idx02 2.6.43.5-2.fc15.x86_64 #1 SMP Tue May 8 11:09:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux', * 'os_distro' => 'Fedora release 15', I'm attaching a representative page. The page came from: http://www.scotts.com/smg/templates/index.jsp?pageUrl=orthoLanding The problem seems to occur around the HTML: <noscript> <iframe height="0" width="0" style="display:none; visibility:hidden;" src="//www.googletagmanager.com/ns.html?id=GTM-PVLS" /> </noscript> <script> I've added some debugging to the HTML::TokeParser::get_tag sub so it looks like: use Data::Dumper; sub get_tag { my $self = shift; my $token; while (1) { $token = $self->get_token || return undef; warn "Checking token: [".Dumper($token)."]"; my $type = shift @$token; next unless $type eq "S" || $type eq "E"; substr($token->[0], 0, 0) = "/" if $type eq "E"; return $token unless @_; for (@_) { return $token if $token->[0] eq $_; } } } I've tried both version 3.65 and 3.69 of HTML::Parser, which both produce the same results. They produce output in the "output" attachment. You can see on like 290 of the output that it is tokenizing almost the entire page after the iframe as one big text blob. Thanks again, -Carl Carl Eklof CTO @ Blosm Inc. blosm.com<http://blosm.com/> 424.888.4BEE Confidentiality Note: This e-mail message and any attachments to it are intended only for the named recipients and may contain confidential information. If you are not one of the intended recipients, please do not duplicate or forward this e-mail message and immediately delete it from your computer. By accepting and opening this email, recipient agrees to keep all information confidential and is not allowed to distribute to anyone outside their organization.
content-type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: us-ascii
Content-Length: 18863
Download (untitled) / with headers
text/html 18.4k

Message body is not shown because it is too large.

Content-Description: scottslandingpage.html
content-type: text/html; name="scottslandingpage.html"
content-disposition: attachment; creation-date="Sat, 23 Feb 2013 17:25:14 GMT"; filename="scottslandingpage.html"; modification-date="Sat, 23 Feb 2013 17:19:08 GMT"; size="73660"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: utf-8
Content-Length: 73660

Message body is not shown because sender requested not to inline it.

Content-Description: scottslandingpage_output.txt
content-type: text/plain; charset="utf-8"; name="scottslandingpage_output.txt"
content-disposition: attachment; creation-date="Sat, 23 Feb 2013 17:31:54 GMT"; filename="scottslandingpage_output.txt"; modification-date="Sat, 23 Feb 2013 17:31:46 GMT"; size="208618"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: ascii
Content-Length: 208618

Message body is not shown because sender requested not to inline it.

MIME-Version: 1.0
In-Reply-To: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-11018-1420334617-0.83570-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 322
Download (untitled) / with headers
text/plain 322b
I've been seeing this with some code I'm working on soon. To summarize this very simply, it seems like HTML::TokeParser does something weird when a tag contains a self-closing slash. If the tag is written as "<hr/>" then the parser things the tag is "hr/". If it's written as "<hr />" then we end up with a "/" attribute.
MIME-Version: 1.0
In-Reply-To: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-21224-1420387215-615.83570-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 244
Download (untitled) / with headers
text/plain 244b
I cloned the repo with the intention of fixing this, but when I looked through the test cases I realized that this behavior is actually tested for. Gisle, what's up with this? It's not documented, AFAICT, and it really doesn't make much sense.
MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-21224-1420387215-615.83570-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com> <rt-4.0.18-21224-1420387215-615.83570-0-0 [...] rt.cpan.org>
Content-Type: text/html; charset="utf-8"
Message-ID: <rt-4.0.18-2100-1453162360-487.83570-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 1702
On Sun Jan 04 11:00:15 2015, DROLSKY wrote:
Show quoted text
> I cloned the repo with the intention of fixing this, but when I looked
> through the test cases I realized that this behavior is actually
> tested for.
>
> Gisle, what's up with this? It's not documented, AFAICT, and it really
> doesn't make much sense.

Perhaps just based on my understanding of what status this had based on this advice from the XHTML spec.

C.2. Empty Elements

Include a space before the trailing / and > of empty elements, e.g. <br />, <hr /> and <img src="karen.jpg" alt="Karen" />. Also, use the minimized tag syntax for empty elements, e.g. <br />, as the alternative syntax <br></br> allowed by XML gives uncertain results in many existing user agents.

MIME-Version: 1.0
In-Reply-To: <rt-4.0.18-2100-1453162360-487.83570-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com> <rt-4.0.18-21224-1420387215-615.83570-0-0 [...] rt.cpan.org> <rt-4.0.18-2100-1453162360-487.83570-0-0 [...] rt.cpan.org>
Content-Type: text/html; charset="utf-8"
Message-ID: <rt-4.0.18-12559-1453162894-1880.83570-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 116
http://www.w3.org/TR/html5/syntax.html#tag-name-state seems clear on allowing this, so feel free to change the tests
MIME-Version: 1.0
In-Reply-To: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <20A53DFAC3F98343A91E7915018A9B2101B11B [...] ORD2MBX04G.mex05.mlsrvr.com>
Content-Type: text/html; charset="utf-8"
Message-ID: <rt-4.0.18-10656-1453224859-1699.83570-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 180
Just turning on the "empty_element_tags" option might make the parser behave the way you expect.  It might be that we should just switch the default for this option.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.