Skip Menu | You are currently an anonymous guest. | Login | Return to Main | About rt.cpan.org
 

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.

X Report information
Id: 18965
Status: resolved
Left: 0 min
Priority: 0/0
Queue: HTML-Parser

Owner: Nobody
Requestors: code [...] yaakovnet.net
Cc:
AdminCc:

Severity: Normal
Broken in: 3.52
Fixed in: (no value)




X History Display mode: Brief headersFull headers
#   Fri Apr 28 03:24:21 2006 guest - Ticket created  
Subject: <script/> leads to ignoring <script> events
[text/plain 3.8k]
First of all, thank you very much for integrating bug 18936 so quickly
into release 3.53.

This bug applies to both 3.52 and to 3.53 releases (just the input form
does not yet offer the new version number).

*** Problem ***

After declaring
$p->empty_element_tags(1); $p->ignore_elements("script","x");

the tag <x/> works correctly like <x></x>.
Howeverm the tag <script/> confuses the parser: A following <script> tag
is ignored and left in the text event!

The attached test script runs a few sample strings through the parser
with the above settings and prints the text, tag and event values.
The first example demonstrates the bug. The following examples
demonstrate that the <x/> and <y/> tags work correctly according to the
documentation:


*** Tests with version 3.53 ****

================ Parse: <script/>A<script>B</script>C ================
'' start_document
'A<script>B' text
'' end_document

================ Parse: <x/>A<x>B</x>C ================
'' start_document
'A' text
'C' text
'' end_document

================ Parse: <y/>A<y>B</y>C ================
'' start_document
'<y/>' <y> start
'' </y> end
'A' text
'<y>' <y> start
'B' text
'</y>' </y> end
'C' text
'' end_document

================ Parse: </x>A ================
'' start_document
'' end_document
www[...]kranich:~/111$ perl test.pl

================ Parse: <script/>A<script>B</script>C ================
'' start_document
'A<script>B' text
'C' text
'' end_document

================ Parse: <x/>A<x>B</x>C ================
'' start_document
'A' text
'C' text
'' end_document

================ Parse: <y/>A<y>B</y>C ================
'' start_document
'<y/>' <y> start
'' </y> end
'A' text
'<y>' <y> start
'B' text
'</y>' </y> end
'C' text
'' end_document

================ Parse: </x>A ================
'' start_document
'A' text
'' end_document


For your reference, I run the same script with version 3.52. We find
that the two bugs are not related: the output shows both the effects of
this bug and the effects of bug 18936:

================ Parse: <script/>A<script>B</script>C ================
'' start_document
'A<script>B' text
'' end_document

================ Parse: <x/>A<x>B</x>C ================
'' start_document
'A' text
'C' text
'' end_document

================ Parse: <y/>A<y>B</y>C ================
'' start_document
'<y/>' <y> start
'' </y> end
'A' text
'<y>' <y> start
'B' text
'</y>' </y> end
'C' text
'' end_document

================ Parse: </x>A ================
'' start_document
'' end_document
www[...]kranich:~/111$ perl -Mblib=HTML-Parser-3.52/lib/ test.pl

================ Parse: <script/>A<script>B</script>C ================
'' start_document
'A<script>B' text
'' end_document

================ Parse: <x/>A<x>B</x>C ================
'' start_document
'A' text
'C' text
'' end_document

================ Parse: <y/>A<y>B</y>C ================
'' start_document
'<y/>' <y> start
'' </y> end
'A' text
'<y>' <y> start
'B' text
'</y>' </y> end
'C' text
'' end_document

================ Parse: </x>A ================
'' start_document
'' end_document

This time, I don't have a fix.

Best regards,

Yaakov Belch
Subject: test.pl

[text/x-perl 482b]
#!/usr/bin/perl -w
use HTML::Parser (); my $p;

$p=HTML::Parser->new( api_version => 3);
$p->empty_element_tags(1);
$p->ignore_elements("script","x");
$p->handler("default"=>sub{my($event,$text,$tag)=@_;
$tag=$tag?"<$tag>":"";
print "'$text'\t$tag\t$event\n";
},"event,text,tag");
for my $text (
'<script/>A<script>B</script>C',
'<x/>A<x>B</x>C',
'<y/>A<y>B</y>C',
'</x>A'
) {
print "\n================ Parse: $text ================\n";
$p->parse($text)->eof;
}





#   Fri Apr 28 03:50:03 2006 GAAS - Correspondence added  
[text/plain 149b]
Good catch! The empty_element_tag feature interacts badly with literal
mode, but the fix was easy. See attached patch. I'll uploaded 3.54
today :)

[text/x-patch 656b]
Index: hparser.c
===================================================================
RCS file: /cvsroot/libwww-perl/html-parser/hparser.c,v
retrieving revision 2.129
diff -u -p -r2.129 hparser.c
--- hparser.c 27 Apr 2006 11:44:00 -0000 2.129
+++ hparser.c 28 Apr 2006 07:47:37 -0000
@@ -1383,8 +1383,7 @@ parse_start(PSTATE* p_state, char *beg,
report_event(p_state, E_START, beg, s, utf8, tokens, num_tokens, self);
if (empty_tag)
report_event(p_state, E_END, s, s, utf8, tokens, 1, self);
-
- if (!p_state->xml_mode) {
+ else if (!p_state->xml_mode) {
/* find out if this start tag should put us into literal_mode
*/
int i;

#   Fri Apr 28 03:50:04 2006 RT_System - Status changed from 'new' to 'open'  
#   Mon May 01 05:31:03 2006 GAAS - Fixed in 3.53 added  
#   Mon May 01 05:31:27 2006 GAAS - Fixed in 3.53 deleted  
#   Mon May 01 05:33:08 2006 GAAS - Status changed from 'open' to 'resolved'