Skip Menu |
 

This queue is for tickets about the HTML-TableExtract CPAN distribution.

Report information
The Basics
Id: 27372
Status: open
Priority: 0/
Queue: HTML-TableExtract

People
Owner: Nobody in particular
Requestors: Marcin.Kasperski [...] mekk.waw.pl
Cc:
AdminCc:

Bug Information
Severity: Wishlist
Broken in: 1.10
Fixed in: (no value)



Subject: Access to row/cell attributes
Download (untitled) / with headers
text/plain 587b
It would be nice, if TableExtract (when asked by some parameter) allowed one to access information present in the attributes of <tr> and <td> tags. MOtivation? Well, I am just parsing the table in which I need to extract some URL from construct similar to <tr onclick="window.location.href='<valuable url here>'> Solution of my dreams? Well, if I could (sideways normal columns) define pseudocolumn 'tr/onclick' and get there whatever is in attribute onclick of tr. Similar problem sometimes happen with <td>, there also I faced cases when valuable URL must be digged from attribute.
From: davidrw [...] cpan.org
Download (untitled) / with headers
text/plain 379b
On Fri Jun 01 13:35:06 2007, Mekk wrote: Show quoted text
> It would be nice, if TableExtract (when asked by some parameter) allowed > one to access information present in the attributes of <tr> and <td> tags.
Attached is a patch (including POD update) and a test file. Patch is against HTML::TableExtract-2.10, and test suite passes before & after (v5.6.1; Linux 2.4.21-32.0.1.EL i686 unknown)
Download cell_attribs.t
text/x-perl 2.1k
#!/usr/bin/perl use strict; use warnings; use Test::More tests => 52; use HTML::TableExtract; my $te = HTML::TableExtract->new( ); my $html = do{ local $/ = undef; <DATA> }; ok($te->parse($html), "parse_file"); my @t = $te->tables; is(@t, 2, "extract count"); { my $ts = $t[1]; ok($ts, "===outer table==="); is(join(',',$ts->coords),'0,0','coords'); my @rows = $ts->rows; my $R = scalar @rows; is($R,5,'rows'); my $C = scalar @{$rows[0]}; is($C,3,'cols'); foreach my $r ( 0 .. 3 ){ is( $ts->cell_attr($r)->{foo}, "row$r", "($r) attribs" ); foreach my $c ( 0 .. 2 ){ is( $ts->cell($r,$c), "cell$r-$c", "($r,$c) contents" ); is( $ts->cell_attr($r,$c)->{foo}, "cell$r,$c", "($r,$c) attribs" ); } } } { my $ts = $t[0]; ok($ts, "===inner table==="); is(join(',',$ts->coords),'1,0','coords'); my @rows = $ts->rows; my $R = scalar @rows; is($R,2,'rows'); my $C = scalar @{$rows[0]}; is($C,3,'cols'); foreach my $r ( 0 .. 1 ){ is( $ts->cell_attr($r)->{foo}, "t2row$r", "t2($r) attribs" ); foreach my $c ( 0 .. 2 ){ is( $ts->cell($r,$c), "t2cell$r-$c", "t2($r,$c) contents" ); is( $ts->cell_attr($r,$c)->{foo}, "t2cell$r,$c", "t2($r,$c) attribs" ); } } } __DATA__ <html> <head><title>TableExtract Test HTML</title></head> <body> <table> <tr foo="row0"> <th foo="cell0,0">cell0-0</th> <th foo="cell0,1">cell0-1</th> <th foo="cell0,2">cell0-2</th> </tr> <tr foo="row1"> <td foo="cell1,0">cell1-0</td> <td foo="cell1,1">cell1-1</td> <td foo="cell1,2">cell1-2</td> </tr> <tr foo="row2"> <td foo="cell2,0">cell2-0</td> <td foo="cell2,1">cell2-1</td> <td foo="cell2,2">cell2-2</td> </tr> <tr foo="row3"> <td foo="cell3,0">cell3-0</td> <td foo="cell3,1">cell3-1</td> <td foo="cell3,2">cell3-2</td> </tr> <tr foo="row4"> <td foo="cell4,0" colspan=3> <table> <tr foo="t2row0"> <th foo="t2cell0,0">t2cell0-0</th> <th foo="t2cell0,1">t2cell0-1</th> <th foo="t2cell0,2">t2cell0-2</th> </tr> <tr foo="t2row1"> <td foo="t2cell1,0">t2cell1-0</td> <td foo="t2cell1,1">t2cell1-1</td> <td foo="t2cell1,2">t2cell1-2</td> </tr> </table> </td> </tr> </table> </body> </html>
Download cell_attribs.patch
text/x-diff 2.7k
*** ../HTML-TableExtract-2.10/lib/HTML/TableExtract.pm Sat Jul 15 19:52:34 2006 --- lib/HTML/TableExtract.pm Sat Jan 12 19:05:33 2008 *************** *** 125,135 **** my $skiptag = 0; if ($_[0] eq 'tr') { $ts->_enter_row; ++$skiptag; } elsif ($_[0] eq 'td' || $_[0] eq 'th') { $ts->_enter_cell(@_); ! my %attrs = ref $_[1] ? %{$_[1]} : {}; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; $ts->_rasterizer->($ts->row_count, $rspan, $cspan); --- 125,138 ---- my $skiptag = 0; if ($_[0] eq 'tr') { $ts->_enter_row; + my %attrs = ref $_[1] ? %{$_[1]} : (); + $ts->{cell_attribs}->{ $ts->{rc} }->{tr} = \%attrs if scalar keys %attrs; ++$skiptag; } elsif ($_[0] eq 'td' || $_[0] eq 'th') { $ts->_enter_cell(@_); ! my %attrs = ref $_[1] ? %{$_[1]} : (); ! $ts->{cell_attribs}->{ $ts->{rc} }->{ $ts->{cc} } = \%attrs if scalar keys %attrs; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; $ts->_rasterizer->($ts->row_count, $rspan, $cspan); *************** *** 454,459 **** --- 457,463 ---- children => [], captured => 0, debug => 0, + cell_attribs => {}, }; $self->{_rastamon} = HTML::TableExtract::Rasterize->make_rasterizer(); *************** *** 740,746 **** } ++$self->{cc}; ++$self->{in_cell}; ! my %attrs = ref $_[1] ? %{$_[1]} : {}; my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; } --- 744,750 ---- } ++$self->{cc}; ++$self->{in_cell}; ! my %attrs = ref $_[1] ? %{$_[1]} : (); my $rspan = $attrs{rowspan} || 1; my $cspan = $attrs{colspan} || 1; } *************** *** 911,916 **** --- 915,929 ---- $self->_cell_to_content($row->[$c]); } + sub cell_attr { + my $self = shift; + my($r, $c) = @_; + $c = 'tr' unless defined $c; + return unless exists $self->{cell_attribs}->{$r}; + return unless exists $self->{cell_attribs}->{$r}->{$c}; + return $self->{cell_attribs}->{$r}->{$c}; + } + sub _cell_to_content { my $self = shift; @_ or croak "cell item required\n"; *************** *** 1691,1696 **** --- 1704,1719 ---- covered due to rowspan or colspan issues, in which case the content of the covering cell is returned rather than undef. + =item cell_attr($row,$col) + + Return a hashref of HTML attributes for the TD/TH element. + Returns undef if no attributes. + + =item cell_attr($row) + + Return a hashref of HTML attributes for the TR element. + Returns undef if no attributes. + =item depth() Return the depth at which this table was found.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.