This queue is for tickets about the HTML-Parser CPAN distribution.

Report information
The Basics
Id:
3166
Status:
resolved
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
siegmann [...] tinbergen.nl
Cc:
AdminCc:

BugTracker
Severity:
Wishlist
Broken in:
(no value)
Fixed in:
(no value)



Subject: Make get_text accept multiple tokens to read up to
As is it now, get_text only accepts one endtag, i.e., $p->get_text( [$endtag] ) But what if I want to get the text up to either an <a>, <img> or <frame> token, for example? This is extremely useful in the context of retrieving text from a page with its surrounding links or images, see WWW::Mechanize. Please consider get_text to have the same arguments as get_tag, namely ([$tag, ...]). I (still) have a three-line patch lying around if you are interested, please consider it!
My mailbox is a mess. Can you post the patch you suggest here?
Subject: patch
From: siegmann@tinbergen.nl
[GAAS - Fri Oct 3 08:50:30 2003]:
Show quoted text
> My mailbox is a mess. Can you post the patch you suggest here?
Here is the email again, patch attached (arjen.diff) cheers, arjen I have thought about the two ways of extending the get_text sub in HTML:TokeParser. (1. let it have an array argument, 2. reference to array) I think that the array reference(2) could be useful for future use, but it breaks backward compatibility, doesn't it? Option 1. is a one-line patch (excluding documentation changes, which I've also done), and existing calls to get_text remain valid. It would be great if you could consider the attached patch that accomplishes it. Please let me know what you think..
--- TokeParser.pm Tue Apr 10 19:44:04 2001 +++ TokeParser_new.pm Sat Mar 15 19:07:38 2003 @@ -88,7 +88,7 @@ } else { $tag = "/$tag"; } - if (!defined($endat) || $endat eq $tag) { + if (!defined($endat) || grep { $_ eq $tag } ($endat,@_) ) { $self->unget_token($token); last; } @@ -200,13 +200,15 @@ ["/$tag", $text] -=item $p->get_text( [$endtag] ) +=item $p->get_text( [$endtag, ...] ) This method returns all text found at the current position. It will -return a zero length string if the next token is not text. The -optional $endtag argument specifies that any text occurring before the -given tag is to be returned. Any entities will be converted to their -corresponding character. +return a zero length string if the next token is not text. If +one or more arguments are given, then we return any text occurring before the first of the specified tags found. For example: + + $p->get_text("p", "br"); + +will return the text up to either a paragraph of linebreak element. Any entities will be converted to their corresponding character. The $p->{textify} attribute is a hash that defines how certain tags can be treated as text. If the name of a start tag matches a key in this @@ -225,7 +227,7 @@ This means that <IMG> and <APPLET> tags are treated as text, and that the text to substitute can be found in the ALT attribute. -=item $p->get_trimmed_text( [$endtag] ) +=item $p->get_trimmed_text( [$endtag, ...] ) Same as $p->get_text above, but will collapse any sequences of white space to a single space character. Leading and trailing white space is


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.