This queue is for tickets about the XML-Atom CPAN distribution.

Report information
The Basics
Id:
43212
Status:
new
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
vargok [...] yahoo.com
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: XML vs. [X]HTML parsing
Date: Wed, 11 Feb 2009 09:35:05 -0800 (PST)
To: bug-XML-Atom@rt.cpan.org
From: Kevin Vargo <vargok@yahoo.com>
Hi, We're using v0.33 of XML::Atom, and noticed that sometimes XHTML fragments will get marked down to escaped <content type="text">. This appears to be the result of LibXML returning an invalid parse of the content, due to &nbsp; -- valid in XHTML, and not valid in XML. I note that LibXML has a parse_html_string mode that appears do The Right Thing here, but have not verified it in the code. The are of code seems to be in: Content.pm around where the eval{... } and check for LIBXML occurs; $node is returned empty from the parse attempt. Replacing &nbsp; for &#160; runs through valid as xhtml. Basically, if $node comes back empty from the eval, I the parse again, but via the html method, and it comes in as xhtml what appears to be properly. Something along the lines of the following should work -- once proper error handling has been added: --- /usr/lib/perl5/site_perl/5.8.8/XML/Atom/Content.pm 2009-02-11 12:32:36.000000000 -0500 +++ /home/vargo/tmp/Content.pm-vargo 2010-02-11 12:32:58.000000000 -0500 @@ -63,6 +63,13 @@ if $xp; } }; + + if (! $node) { + my $parser = XML::LibXML->new; + my $tree = $parser->parse_html_string($copy); + $node = $tree->getDocumentElement; + } + if (!$@ && $node) { $elem->appendChild($node); if ($content->version == 0.3) { Thanks, Kevin


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.