Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Parallel-ForkManager CPAN distribution.

Report information
The Basics
Id:
38724
Status:
open
Priority:
Low/Low

People
Owner:
dlux [...] dlux.hu
Requestors:
frederik [...] remote.org
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



Subject: Parallel::ForkManager loops in wait_all_children
Date: Tue, 26 Aug 2008 11:03:38 +0200
To: bug-Parallel-ForkManager@rt.cpan.org
From: Frederik Ramm <frederik@remote.org>
Hi, I have a problem under Linux where a rather complex script I did sometimes hangs (in a tight loop) when it runs wait_all_children. I cannot reproduce it with a test script; it only happens in production and only sometimes! I'm not doing anything strange, just instantiating a ForkManager, then every now and then doing a "start" and "finish". No callbacks, nothing. strace()ing a hanging process reveals that it continously calls "wait4" which returns an ECHILD error (no children to wait for). In inspected the source and I believe it must somehow have missed a SIGCHLD so that it thinks there are still child processes while in fact there aren't. I will now try and fix it by changing wait_all_children thus: sub wait_all_children { my ($s)=@_; while (keys %{ $s->{processes} }) { $s->on_wait; $s->wait_one_child(defined $s->{on_wait_period} ? &WNOHANG : undef); if ($! == ECHILD) { delete $s->{processes}; last; } }; } of course this is a very brutal way to do it - would be better to not miss the SIGCHLD in the first place, but at least I hope my program can continue this way. Bye Frederik
Hi, Thanks for letting me know. I currently don't really have time for that, but as long as I'll have, I'll check this... Cheers, Balázs
Hi, I'm trying to find a solution which catches all signals, but I am not smarter according to the documentation. Can you help me on this? I wonder maybe the logic in wait_one_child is not perfect. I wonder maybe the NT waitpid implementation is better in linux, too. Do you have time to test it? Balázs On Sun Aug 31 07:17:47 2008, DLUX wrote:
Show quoted text
> Hi, > > Thanks for letting me know. I currently don't really have time for that, > but as long as I'll have, I'll check this... > > Cheers, > > Balázs
On Sat Nov 22 18:46:57 2008, DLUX wrote:
Show quoted text
> Hi, > > I'm trying to find a solution which catches all signals, but I am not > smarter according to the documentation. > > Can you help me on this? I wonder maybe the logic in wait_one_child is > not perfect. I wonder maybe the NT waitpid implementation is better in > linux, too. > > Do you have time to test it? > > Balázs > > On Sun Aug 31 07:17:47 2008, DLUX wrote:
> > Hi, > > > > Thanks for letting me know. I currently don't really have time for that, > > but as long as I'll have, I'll check this... > > > > Cheers, > > > > Balázs
>
Do you use the on_wait callback? It temporarily switches off the CHLD signal handling, maybe it causes problem. Could you test it? Unfortunately I am not using this module any more, so I cannot really do that... On Sat Nov 22 18:54:48 2008, DLUX wrote:
Show quoted text
> On Sat Nov 22 18:46:57 2008, DLUX wrote:
> > Hi, > > > > I'm trying to find a solution which catches all signals, but I am not > > smarter according to the documentation. > > > > Can you help me on this? I wonder maybe the logic in wait_one_child is > > not perfect. I wonder maybe the NT waitpid implementation is better in > > linux, too. > > > > Do you have time to test it? > > > > Balázs > > > > On Sun Aug 31 07:17:47 2008, DLUX wrote:
> > > Hi, > > > > > > Thanks for letting me know. I currently don't really have time for
that,
Show quoted text
> > > but as long as I'll have, I'll check this... > > > > > > Cheers, > > > > > > Balázs
> >
> >
On Tue Aug 26 05:05:51 2008, frederik@remote.org wrote:
Show quoted text
> I have a problem under Linux where a rather complex script I did > sometimes hangs (in a tight loop) when it runs wait_all_children.
I've just experienced the same and after a bit of research I found another piece of code doing waitpid(2) calls, probably stealing som pids from Parallel::ForkManager. I know that this configuration isn't supported by Parallel::ForkManager but it would be nice if Parallel::ForkManager was more robust when this happens. Frederik's solution would be one step. Wrapping _waitpid to scan for "missed" processes would be another step. Either way, if you don't use the module anymore and don't have time to maintain it I could offer to take over maintaince of it.
Subject: Re: [rt.cpan.org #38724] Parallel::ForkManager loops in wait_all_children
Date: Thu, 18 Jun 2009 22:43:20 +0200
To: bug-Parallel-ForkManager@rt.cpan.org
From: Balázs Szabó <dlux@dlux.hu>
Hi Peter,

Good ideas!

I'm glad to hear that you are volunteering for maintaining the module, and I'm happy to hear that!

Please drop me a private email so that we can discuss the details of it!

Balázs

On Wed, Jun 17, 2009 at 4:47 PM, Peter Makholm via RT <bug-Parallel-ForkManager@rt.cpan.org> wrote:
Show quoted text
      Queue: Parallel-ForkManager
 Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 >

On Tue Aug 26 05:05:51 2008, frederik@remote.org wrote:

>     I have a problem under Linux where a rather complex script I did
> sometimes hangs (in a tight loop) when it runs wait_all_children.

I've just experienced the same and after a bit of research I found
another piece of code doing waitpid(2) calls, probably stealing som pids
from Parallel::ForkManager.

I know that this configuration isn't supported by Parallel::ForkManager
but it would be nice if Parallel::ForkManager was more robust when this
happens.

Frederik's solution would be one step. Wrapping _waitpid to scan for
"missed" processes would be another step.

Either way, if you don't use the module anymore and don't have time to
maintain it I could offer to take over maintaince of it.



--
Balázs Szabó (dLux)
www.dlux.hu

你很好奇
From: fbicknel@nc.rr.com
I haven't seen much activity here of late, but I think I've stumbled into the same situation: I can't figure out why, but sometimes pm will get in a situation where the child processes it should be tracking are gone, but it continues to think they are still there. I fixed this in my own brute-force way by adding this to wait_one_child. I chose to put it here, as that seems to be the go-to method for waiting. Anyway, my addition appears below (line 342 in the sample of code below). If I can find out what is causing the 'dropped' deletes, maybe I could attack the source of the problem rather than just fix it in this brute force way. I'll let you know if I can. I also realize this may not work on other platforms; sorry I can't test it anywhere but Unix. 332 sub wait_one_child { my ($s,$par)=@_; 333 my $kid; 334 while (1) { 335 $kid = $s->_waitpid(-1,$par||=0); 336 last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns negative PIDs 337 redo if !exists $s->{processes}->{$kid}; 338 my $id = delete $s->{processes}->{$kid}; 339 $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0); 340 last; 341 } 342 # Make sure there are not 'package zombies', i.e. processes 343 # that have exited, but are somehow still in the tracking hash 344 for my $kid (keys %{$s->{'processes'}}) { 345 unless (kill (0, $kid)) { 346 delete $s->{'processes'}{$kid}; 347 } 348 } 349 $kid; 350 };
Subject: Re: [rt.cpan.org #38724] Parallel::ForkManager loops in wait_all_children
Date: Tue, 16 Feb 2010 02:06:11 +0000
To: bug-Parallel-ForkManager@rt.cpan.org
From: Balázs Szabó <dlux@dlux.hu>
Hi Frank,

Thanks for the investigation!

I accept patches if you have a good solution!

Balázs

On Mon, Feb 15, 2010 at 21:43, Frank Bicknell via RT <bug-Parallel-ForkManager@rt.cpan.org> wrote:
Show quoted text
      Queue: Parallel-ForkManager
 Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 >

I haven't seen much activity here of late, but I think I've stumbled
into the same situation: I can't figure out why, but sometimes pm will
get in a situation where the child processes it should be tracking are
gone, but it continues to think they are still there.

I fixed this in my own brute-force way by adding this to wait_one_child.
 I chose to put it here, as that seems to be the go-to method for waiting.

Anyway, my addition appears below (line 342 in the sample of code
below).  If I can find out what is causing the 'dropped' deletes, maybe
I could attack the source of the problem rather than just fix it in this
brute force way.  I'll let you know if I can.

I also realize this may not work on other platforms; sorry I can't test
it anywhere but Unix.

   332 sub wait_one_child { my ($s,$par)=@_;
   333   my $kid;
   334   while (1) {
   335     $kid = $s->_waitpid(-1,$par||=0);
   336     last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns
negative PIDs
   337     redo if !exists $s->{processes}->{$kid};
   338     my $id = delete $s->{processes}->{$kid};
   339     $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1
: 0);
   340     last;
   341   }
   342     # Make sure there are not 'package zombies', i.e. processes
   343     # that have exited, but are somehow still in the tracking hash
   344     for my $kid (keys %{$s->{'processes'}}) {
   345         unless (kill (0, $kid)) {
   346             delete $s->{'processes'}{$kid};
   347         }
   348     }
   349   $kid;
   350 };




--
Balázs Szabó (dLux)
www.dlux.hu

你很好奇
Hi all, Is it happening to you? I wonder what could cause this. Frank's solution should work, I have only one thing to worry about: the return value of the child process. We have to call the on_finish callback with some return value. Balázs On Mon Feb 15 21:07:33 2010, DLUX wrote:
Show quoted text
> Hi Frank, > > Thanks for the investigation! > > I accept patches if you have a good solution! > > Balázs > > On Mon, Feb 15, 2010 at 21:43, Frank Bicknell via RT < > bug-Parallel-ForkManager@rt.cpan.org> wrote: >
> > Queue: Parallel-ForkManager > > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=38724 > > > > > I haven't seen much activity here of late, but I think I've stumbled > > into the same situation: I can't figure out why, but sometimes pm will > > get in a situation where the child processes it should be tracking are > > gone, but it continues to think they are still there. > > > > I fixed this in my own brute-force way by adding this to wait_one_child. > > I chose to put it here, as that seems to be the go-to method for waiting. > > > > Anyway, my addition appears below (line 342 in the sample of code > > below). If I can find out what is causing the 'dropped' deletes, maybe > > I could attack the source of the problem rather than just fix it in this > > brute force way. I'll let you know if I can. > > > > I also realize this may not work on other platforms; sorry I can't test > > it anywhere but Unix. > > > > 332 sub wait_one_child { my ($s,$par)=@_; > > 333 my $kid; > > 334 while (1) { > > 335 $kid = $s->_waitpid(-1,$par||=0); > > 336 last if $kid == 0 || $kid == -1; # AS 5.6/Win32 returns > > negative PIDs > > 337 redo if !exists $s->{processes}->{$kid}; > > 338 my $id = delete $s->{processes}->{$kid}; > > 339 $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 > > : 0); > > 340 last; > > 341 } > > 342 # Make sure there are not 'package zombies', i.e. processes > > 343 # that have exited, but are somehow still in the tracking hash > > 344 for my $kid (keys %{$s->{'processes'}}) { > > 345 unless (kill (0, $kid)) { > > 346 delete $s->{'processes'}{$kid}; > > 347 } > > 348 } > > 349 $kid; > > 350 }; > > > >
> >
This happens to me under a Starman/Plack/Dancer system on Linux. Couldn't figured why but after some time (or accesses) waitpid() starts returning -1 for all the children. It runs well as I can check the output of each child for retrieving data over /tmp. Attached a patch based on Frank Bicknell suggestion that calls "on_finish" so you can retrieve data produced by each "lost" child.
Subject: Parallel-ForkManager.diff
--- ForkManager.pm.orig 2013-07-03 15:47:46.870631541 +0100 +++ ForkManager.pm 2013-07-03 15:47:14.138469231 +0100 @@ -557,6 +557,38 @@ $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0, $retrieved); last; } + + # https://rt.cpan.org/Public/Bug/Display.html?id=38724 + if ( $kid == -1 ) { + + # Make sure there are not 'package zombies', i.e. processes + # that have exited, but are somehow still in the tracking hash + + for my $kid (keys %{$s->{'processes'}}) { + unless (kill (0, $kid)) { + + # retrieve child data structure, if any + my $retrieved = undef; + my $storable_tempfile = File::Spec->catfile($s->{tempdir}, 'Parallel-ForkManager-' . $$ . '-' . $kid . '.txt'); + if (-e $storable_tempfile) { # child has option of not storing anything, so we need to see if it did or not + $retrieved = eval { return &retrieve($storable_tempfile); }; + + # handle Storables errors + if (not $retrieved or $@) { + warn(qq|The storable module was unable to retrieve the child's data structure from the temporary file "$storable_tempfile": | . join(', ', $@)); + } + + # clean up after ourselves + unlink $storable_tempfile; + } + + my $id = delete $s->{processes}->{$kid}; + + $s->on_finish( $kid, $? >> 8 , $id, $? & 0x7f, $? & 0x80 ? 1 : 0, $retrieved); + } + } + } + $kid; };


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.