This queue is for tickets about the AnyEvent-Feed CPAN distribution.

Report information
The Basics
Id:
69931
Status:
open
Priority:
Low/Low
Queue:

People
Owner:
Nobody in particular
Requestors:
MEETTYA [...] cpan.org
Cc:
AdminCc:

BugTracker
Severity:
Important
Broken in:
(no value)
Fixed in:
(no value)



Subject: Patch to _entry_to_hash to wiping HTML from content at digest calculation
Hi Robin! I try out AE-Feed and find its helpful and nice, but one thing must be improved. I talk about _entry_to_hash algo, it good, but "naive" for real world. For example, exist russian geek magazine - overclockers_ru, and RSS - http://www.overclockers.ru/rss/all.rss . So, this bad guy use HTML, embedded into item content, where contained counter or something like it, which generate different link. At the end I have some double for one item, because one item have different digest. I put one item dump, you may see it by yourself. This results gotten by triple _continuously_ re-loading feed. At the patch used HTML::Strip - fast&furious html wiper, writing with XS. It not so smart as HTML::Parser and have some bugs, but I suppose it acceptable for digest. Thanks at advance, Dmitry.
Subject: proof

Message body not shown because it is not plain text.

Subject: html_ignore.patch
--- Feed.pm 2011-07-31 23:38:14.000000000 +0400 +++ Feed.pm 2011-07-31 23:55:45.000000000 +0400 @@ -9,6 +9,7 @@ use AnyEvent::HTTP; use Digest::SHA1 qw/sha1_base64/; use Scalar::Util qw/weaken/; +use HTML::Strip; our $VERSION = '0.3'; @@ -124,6 +125,12 @@ which means that an entry hash is removed from the C<entry_ages> hash after it has not been seen in the feed for 2 fetches. +=item no_html_in_hash => $boolean + +This setting will enable "smart" mode of C<entry_ages> calculation. All html will +be ignored. May be useful if item contained dynamic html and it proceed doubles +at output. + =back =cut @@ -135,6 +142,11 @@ bless $self, $class; $self->{entry_ages} ||= {}; + + # add strip object if it needed + if (defined $self->{no_html_in_hash}){ + $self->{__hs} = HTML::Strip->new(); + } if (defined $self->{interval}) { unless (defined $self->{on_fetch}) { @@ -163,17 +175,30 @@ } -sub _entry_to_hash { - my ($entry) = @_; - my $x = sha1_base64 +sub _entry_to_hash{ + my ( $entry, $hs ) = ( @_ ); + # we are wipe all html if it need it, or return 'body' + my @data = map { + $_ && $_->body ? + ( $hs ? + do { + my $a = $hs->parse( $_->body ); + $hs->eof; + $a;} + : + $_->body ) + : '' + } ($entry->summary, $entry->content); + + my $x = sha1_base64 encode 'utf-8', - (my $a = join '/', + ( join '/', $entry->title, - ($entry->summary ? $entry->summary->body : ''), - ($entry->content ? $entry->content->body : ''), + @data, $entry->id, $entry->link); - $x + $x; + } sub _new_entries { @@ -189,7 +214,7 @@ $self->{entry_ages}->{$_}++ for keys %{$self->{entry_ages}}; for my $ent (@ents) { - my $hash = _entry_to_hash ($ent); + my $hash = _entry_to_hash ( $ent, $self->{__hs} ); unless (exists $self->{entry_ages}->{$hash}) { push @new, [$hash, $ent]; @@ -252,7 +277,7 @@ sub _get_headers { my ($self, %hdrs) = @_; - my %hdrs = %{$self->{headers} || {}}; + %hdrs = %{$self->{headers} || {}}; if (defined $self->{last_mod}) { $hdrs{'If-Modified-Since'} = $self->{last_mod};
Subject: Re: [rt.cpan.org #69931] Patch to _entry_to_hash to wiping HTML from content at digest calculation
Date: Thu, 4 Aug 2011 10:05:21 +0200
To: Карпич Дмитрий via RT <bug-AnyEvent-Feed@rt.cpan.org>
From: Robin Redeker <elmex@ta-sa.org>
Hi Dmitry! On Mon, Aug 01, 2011 at 05:03:05AM -0400, Карпич Дмитрий via RT wrote:
Show quoted text
> Mon Aug 01 05:03:04 2011: Request 69931 was acted upon. > Transaction: Ticket created by MEETTYA > Queue: AnyEvent-Feed > Subject: Patch to _entry_to_hash to wiping HTML from content at digest > calculation > Broken in: (no value) > Severity: Important > Owner: Nobody > Requestors: MEETTYA@cpan.org > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=69931 > > > > Hi Robin! > > I try out AE-Feed and find its helpful and nice, but one thing must be improved. > I talk about _entry_to_hash algo, it good, but "naive" for real world. > > For example, exist russian geek magazine - overclockers_ru, and RSS - > http://www.overclockers.ru/rss/all.rss . So, this bad guy use HTML, embedded into item > content, where contained counter or something like it, which generate different link. At the > end I have some double for one item, because one item have different digest.
Oh, that sounds indeed bad :)
Show quoted text
> At the patch used HTML::Strip - fast&furious html wiper, writing with XS. It not so smart as > HTML::Parser and have some bugs, but I suppose it acceptable for digest.
I'm a bit reluctant applying the patch right away. It's only that your patch comes in a completely different coding/formatting style (tabs instead of spaces, spaces around the parentheses different from other code). But it's great that it comes with a few lines of documentation! I'm a bit reluctant to use HTML::Strip, but the module is already using XML::Feed, which is a dependency beast on it's own too :) And HTML::Parser doesn't really feel like a good alternative too. So I shouldn't really care. The entry_to_hash method is quite bold relying so much on the contents of the post anyways. It would be awesome if you could fix the remaining issues and resubmit the patch! Greetings, Robin -- Robin Redeker | Deliantra, the free code+content MORPG elmex@ta-sa.org / r.redeker@gmail.com | http://www.deliantra.net http://www.ta-sa.org/ |
Hi, Robin! Very glad to see you answer so fast. On Thu Aug 04 04:03:02 2011, ELMEX wrote:
Show quoted text
> Hi Dmitry! > > On Mon, Aug 01, 2011 at 05:03:05AM -0400, Карпич Дмитрий via RT wrote:
> > Mon Aug 01 05:03:04 2011: Request 69931 was acted upon. > > Transaction: Ticket created by MEETTYA > > Queue: AnyEvent-Feed > > Subject: Patch to _entry_to_hash to wiping HTML from content at
> digest
> > calculation
<..>
Show quoted text
> It would be awesome if you could fix the remaining issues and resubmit > the patch! > >
Yap, I`m negligent in my life, and my code suffer from that too :) I fix all and check it twice, now it must be same like other part style.
Show quoted text
> Greetings, > Robin >
Greetings, Dmitry.
Subject: html_ignore2.patch
--- Feed.pm 2011-08-04 22:27:21.000000000 +0400 +++ Feed.pm 2011-08-04 22:40:36.000000000 +0400 @@ -9,6 +9,7 @@ use AnyEvent::HTTP; use Digest::SHA1 qw/sha1_base64/; use Scalar::Util qw/weaken/; +use HTML::Strip; our $VERSION = '0.3'; @@ -124,6 +125,12 @@ which means that an entry hash is removed from the C<entry_ages> hash after it has not been seen in the feed for 2 fetches. +=item no_html_in_hash => $boolean + +This setting will enable "smart" mode of C<entry_ages> calculation. All html will +be ignored. May be useful if item contained dynamic html and it proceed doubles +at output. + =back =cut @@ -136,6 +143,10 @@ $self->{entry_ages} ||= {}; + if (defined $self->{no_html_in_hash}){ + $self->{__hs} = HTML::Strip->new(); + } + if (defined $self->{interval}) { unless (defined $self->{on_fetch}) { croak "no 'on_fetch' callback given!"; @@ -164,13 +175,25 @@ sub _entry_to_hash { - my ($entry) = @_; + my ($entry, $hs) = @_; + # we are wipe all html if it need it, or return 'body' + my @data = map { + $_ && $_->body ? + ($hs ? + do { + my $a = $hs->parse($_->body); + $hs->eof; + $a;} + : + $_->body) + : '' + } ($entry->summary, $entry->content); + my $x = sha1_base64 encode 'utf-8', - (my $a = join '/', + (join '/', $entry->title, - ($entry->summary ? $entry->summary->body : ''), - ($entry->content ? $entry->content->body : ''), + @data, $entry->id, $entry->link); $x @@ -189,7 +212,7 @@ $self->{entry_ages}->{$_}++ for keys %{$self->{entry_ages}}; for my $ent (@ents) { - my $hash = _entry_to_hash ($ent); + my $hash = _entry_to_hash ($ent, $self->{__hs}); unless (exists $self->{entry_ages}->{$hash}) { push @new, [$hash, $ent]; @@ -252,7 +275,7 @@ sub _get_headers { my ($self, %hdrs) = @_; - my %hdrs = %{$self->{headers} || {}}; + %hdrs = %{$self->{headers} || {}}; if (defined $self->{last_mod}) { $hdrs{'If-Modified-Since'} = $self->{last_mod};


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.