Skip Menu |
 

This queue is for tickets about the XML-LibXML CPAN distribution.

Report information
The Basics
Id: 38666
Status: resolved
Worked: 1.5 hours (90 min)
Priority: 0/
Queue: XML-LibXML

People
Owner: phish [...] cpan.org
Requestors: dwheeler [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



CC: Eric Glover <eric [...] searchme.com>
Subject: URI Option Does Not Work
Date: Fri, 22 Aug 2008 11:24:22 -0700
To: bug-xml-libxml [...] rt.cpan.org
From: "David E. Wheeler" <dwheeler [...] cpan.org>
Download (untitled) / with headers
text/plain 394b
Howdy, This prints an undef: #!/usr/local/bin/perl -w use strict; use warnings; use feature ':5.10'; use XML::LibXML; my $html = '<html><body><p>foo</p></body></html>'; my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { URI => 'http:// foo.com/' }); say $doc->baseURI; Shouldn't baseURI return 'http://foo.com/'? Or am I mis-reading the docs? Thanks, David
Subject: Re: [rt.cpan.org #38666] URI Option Does Not Work
Date: Sat, 23 Aug 2008 10:03:55 +0200
To: bug-XML-LibXML [...] rt.cpan.org
From: Christian Glahn <christian.glahn [...] lo-f.at>
Download (untitled) / with headers
text/plain 1.4k
Hi David, This appears to be a documentation bug. The synopsis suggests a hash reference passed to parse_*string() functions. However, if you look at the actual documentation you find that the function expects a string as the optional second parameter. In this case the synopsis is wrong and the function description is correct. I tested it with your code and it works nicely. Another remark: if you know that your input is XHTML (rather than HTML strict) I suggest that you use the normal parse_string() function instead of its html sibling. Cheers Christian On Fri, 2008-08-22 at 14:24 -0400, David Wheeler via RT wrote: Show quoted text
> Fri Aug 22 14:24:44 2008: Request 38666 was acted upon. > Transaction: Ticket created by DWHEELER > Queue: XML-LibXML > Subject: URI Option Does Not Work > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: dwheeler@cpan.org > Status: new > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38666 > > > > Howdy, > > This prints an undef: > > #!/usr/local/bin/perl -w > > use strict; > use warnings; > use feature ':5.10'; > use XML::LibXML; > > my $html = '<html><body><p>foo</p></body></html>'; > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, { URI => 'http:// > foo.com/' }); > say $doc->baseURI; > > Shouldn't baseURI return 'http://foo.com/'? Or am I mis-reading the > docs? > > Thanks, > > David
-- Christian Glahn <christian.glahn@lo-f.at>
Subject: Re: [rt.cpan.org #38666] URI Option Does Not Work
Date: Sat, 23 Aug 2008 06:48:25 -0700
To: bug-XML-LibXML [...] rt.cpan.org
From: "David E. Wheeler" <dwheeler [...] cpan.org>
Download (untitled) / with headers
text/plain 1.7k
On Aug 23, 2008, at 01:04, Christian Glahn via RT wrote: Show quoted text
> This appears to be a documentation bug. > > The synopsis suggests a hash reference passed to parse_*string() > functions. However, if you look at the actual documentation you find > that the function expects a string as the optional second parameter. > > In this case the synopsis is wrong and the function description is > correct. I tested it with your code and it works nicely.
I just did this: my $html = '<html><body><p>foo</p></body></html>'; my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, 'http://foo.com/'); say $doc->baseURI; And it still printed an undef. Show quoted text
> Another remark: if you know that your input is XHTML (rather than HTML > strict) I suggest that you use the normal parse_string() function > instead of its html sibling.
This is why I'm passing a hash. I'm parsing arbitrary Web pages that will have god knows what kind of HTML in them. So my code actually looks like this: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { suppress_errors => 1, # Suppress errors suppress_warnings => 1, # Suppress warnings no_network => 1, # Don't make network requests. recover => 1, # Relaxed parsing for bad HTML. URI => 'http://foo.com/', }); say $doc->baseURI; Which also, BTW, outputs undef. And so does this: my $doc = $parser->parse_html_string($html, 'http://foo.com/', { suppress_errors => 1, # Suppress errors suppress_warnings => 1, # Suppress warnings no_network => 1, # Don't make network requests. recover => 1, # Relaxed parsing for bad HTML. }); say $doc->baseURI; IOW, there is no way I can see to properly set baseURI. David
Subject: Re: [rt.cpan.org #38666] URI Option Does Not Work
Date: Sun, 24 Aug 2008 19:38:09 +0200
To: bug-XML-LibXML [...] rt.cpan.org
From: Christian Glahn <christian.glahn [...] lo-f.at>
Download (untitled) / with headers
text/plain 2.7k
Hi David, I dived into the code and found two issues and one of them explains your problem. You use the baseURI function. baseURI() uses libxml2's xmlGetNodeBase() function, which determines the base URL for HTML documents from the base tag in the documents header. Your document has no header and no base tag. Hence, the result is correctly undef. But there are good news for you: on the document node of your DOM tree and ONLY for this node, you can call the URI function, which returns the internal URL that has been set by the parse function. Therefore, in line 5 instead of saying $doc->baseURI; you should say $doc->URI;. Cheers and thanks for the report Christian On Sat, 2008-08-23 at 09:48 -0400, David Wheeler via RT wrote: Show quoted text
> Queue: XML-LibXML > Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=38666 > > > On Aug 23, 2008, at 01:04, Christian Glahn via RT wrote: >
> > This appears to be a documentation bug. > > > > The synopsis suggests a hash reference passed to parse_*string() > > functions. However, if you look at the actual documentation you find > > that the function expects a string as the optional second parameter. > > > > In this case the synopsis is wrong and the function description is > > correct. I tested it with your code and it works nicely.
> > I just did this: > > my $html = '<html><body><p>foo</p></body></html>'; > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, 'http://foo.com/'); > say $doc->baseURI; > > And it still printed an undef. >
> > Another remark: if you know that your input is XHTML (rather than HTML > > strict) I suggest that you use the normal parse_string() function > > instead of its html sibling.
> > This is why I'm passing a hash. I'm parsing arbitrary Web pages that > will have god knows what kind of HTML in them. So my code actually > looks like this: > > my $parser = XML::LibXML->new; > my $doc = $parser->parse_html_string($html, { > suppress_errors => 1, # Suppress errors > suppress_warnings => 1, # Suppress warnings > no_network => 1, # Don't make network requests. > recover => 1, # Relaxed parsing for bad HTML. > URI => 'http://foo.com/', > }); > say $doc->baseURI; > > Which also, BTW, outputs undef. And so does this: > > my $doc = $parser->parse_html_string($html, 'http://foo.com/', { > suppress_errors => 1, # Suppress errors > suppress_warnings => 1, # Suppress warnings > no_network => 1, # Don't make network requests. > recover => 1, # Relaxed parsing for bad HTML. > }); > say $doc->baseURI; > > IOW, there is no way I can see to properly set baseURI. > > David
-- Christian Glahn <christian.glahn@lo-f.at>
Download (untitled) / with headers
text/plain 214b
Problem was that baseURI() works slightly different for XML and for HTML documents. To access the URI the has been set during parse time in a consistent way, one should call the URI() function on the document root.
CC: Eric Glover <eric [...] searchme.com>
Subject: Re: [rt.cpan.org #38666] URI Option Does Not Work
Date: Mon, 25 Aug 2008 15:53:42 -0700
To: bug-XML-LibXML [...] rt.cpan.org
From: "David E. Wheeler" <dwheeler [...] cpan.org>
Download (untitled) / with headers
text/plain 1.2k
On Aug 24, 2008, at 10:38, Christian Glahn via RT wrote: Show quoted text
> I dived into the code and found two issues and one of them explains > your > problem.
Thank you, Christian. Show quoted text
> You use the baseURI function. baseURI() uses libxml2's > xmlGetNodeBase() > function, which determines the base URL for HTML documents from the > base > tag in the documents header. Your document has no header and no base > tag. Hence, the result is correctly undef.
Ah, okay, that makes sense. Show quoted text
> But there are good news for you: on the document node of your DOM tree > and ONLY for this node, you can call the URI function, which returns > the > internal URL that has been set by the parse function. > > Therefore, in line 5 instead of saying $doc->baseURI; you should say > $doc->URI;.
Great, this works: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, { URI => 'http:// foo.com/' }); say $doc->URI; Good, that's exactly what I need. Any chance of the docs being updated to reflect this? Note that this does not, however (not that I care, but since it's what the docs seem to indicate: my $parser = XML::LibXML->new; my $doc = $parser->parse_html_string($html, 'http://foo.com/'); say $doc->URI; Best, David
Download (untitled) / with headers
text/plain 531b
I believe the current documentation does not indicate that parse_html_string($html,$uri) should do something useful (unlike parse_html_string($html,{URI=>$uri}), which works as expected). I have added documentation of $doc->URI, added a $doc->setURI method, and added documentation of $node->baseURI and $node->setBaseURI. The changes are in the SVN and will appear in 1.67 (to be released soon). With this, I'm closing this ticket. Please do not reopen it, unless you want to complain about the changes made in SVN. -- Petr


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.