|Subject:||Patch: Attributes with invalid name omitted from XML output|
|Date:||Thu, 20 Oct 2011 12:47:30 +0200|
|To:||bug-HTML-Tree [...] rt.cpan.org|
|From:||Zsbán Ambrus <ambrus [...] math.bme.hu>|
Dear maintainers of HTML-Tree, In HTML-Tree 4.2, if you call the as_XML method of a HTML::Element and there are attributes with invalid names in the HTML, the method dies. I attach a patch that changes the behavior of this method to not die omit those attributes from the output (so you get well-formed XML). A test case is included in the patch. Back story. The current behavior was introduced in response to bug report #23439. However, I think instead of dying it's better to produce some valid XML output. How the invalid attributes are represented in this output I don't really care. I met this issue when I was trying to load some malformed HTML with XML::Twig (which uses HTML::TreeBuilder as its backend). These invalid attributes (resulting from missing quotes around the value in the HTML source) actually occur in a different part of the HTML than the part I want to extract data from. I could just use the strict_names option of HTML::Parser in this case, but that's not an ideal solution in the long term, as that turns the entire element to text, which is not how browsers interpret invalid attributes like this. Thus, I add this patch to be able to parse such documents. I am using HTML-Tree version 4.2 (this patch is based on that), HTML-Parser version 3.69, and perl 5.14.2 vanilla for x86_64-linux. Ambrus
Message body is not shown because sender requested not to inline it.