|Subject:||Follow link based on surrounding text|
This is a great module, but I run into the following issue every time: Websites with electronic versions of (academic) journal articles, e.g., sciencedirect.com, generally present the stuff you want, followed by links to possible generic actions. My wish is therefore to be able to follow a link based on the *preceding material*, and not one of the properties of the link itself. This is comparable with the request someone did for scraping Google news: suppose I want to read all the stories about Iraq, then I won't get far by examining the url-text/href on news.google.com.... And now for a concrete example: Suppose we have two entries for journal articles on 1 page: On the theory of reference-dependent preferences, Pages 407-428 Alistair Munro and Robert Sugden Abstract | Full Text + Links | PDF (149 K) Melioration learning in games with constant and frequency-dependent pay-offs, Pages 429-448 Thomas Brenner and Ulrich Witt Abstract | Full Text + Links | PDF (114 K) In this case, doing $agent->follow('PDF') (or having an url_regex matching '.pdf') is not useful, as you do not want to follow a pdf link, but follow the link to the pdf just right after the correct pagenumbers are mentioned. This is a problem that can occur for several other applications, I imagine. For example, screen scraping your inbox from webmail: for each subject line you can choose 'reply', 'read', 'delete', etc., but the links to those actions are not distinguishable by their name(or url) for the different emails. I think this problem is ultimately solved by having something like a function "follow_context(R1, R2)", which matches R1 on the visible text, and matches R2 on the links that follow after the match of R1. Also, R2 could be allowed to be an integer (possibly negative), which gives the link number starting from the match of R1. I am using my own patched version of WWW::Mechanize that does this and it works great. Therefore, I would love to send in a patch, but I need to think of a non-dirty way of getting the text nodes from a page. I.e., HTML::TokeParser only accepts one parameter in get_text, while we need it to get the text until it meets an <a>, <iframe>, or <frame> tag. Any ideas on this?