Skip Menu |
 

Preferred bug tracker

Please visit the preferred bug tracker to report your issue.

This queue is for tickets about the Spreadsheet-ParseExcel CPAN distribution.

Maintainer(s)' notes

If you are reporting a bug in Spreadsheet::ParseExcel here are some pointers

1) State the issues as clearly and as concisely as possible. A simple program or Excel test file (see below) will often explain the issue better than a lot of text.

2) Provide information on your system, version of perl and module versions. The following program will generate everything that is required. Put this information in your bug report.

    #!/usr/bin/perl -w

    print "\n    Perl version   : $]";
    print "\n    OS name        : $^O";
    print "\n    Module versions: (not all are required)\n";

    my @modules = qw(
                      Spreadsheet::ParseExcel
                      Scalar::Util
                      Unicode::Map
                      Spreadsheet::WriteExcel
                      Parse::RecDescent
                      File::Temp
                      OLE::Storage_Lite
                      IO::Stringy
                    );

    for my $module (@modules) {
        my $version;
        eval "require $module";

        if (not $@) {
            $version = $module->VERSION;
            $version = '(unknown)' if not defined $version;
        }
        else {
            $version = '(not installed)';
        }

        printf "%21s%-24s\t%s\n", "", $module, $version;
    }

    __END__

3) Upgrade to the latest version of Spreadsheet::ParseExcel (or at least test on a system with an upgraded version). The issue you are reporting may already have been fixed.

4) Create a small example program that demonstrates your problem. The program should be as small as possible. A few lines of codes are worth tens of lines of text when trying to describe a bug.

5) Supply an Excel file that demonstrates the problem. This is very important. If the file is big, or contains confidential information, try to reduce it down to the smallest Excel file that represents the issue. If you don't wish to post a file here then send it to me directly: jmcnamara@cpan.org

6) Say if the test file was created by Excel, OpenOffice, Gnumeric or something else. Say which version of that application you used.

7) If you are submitting a patch you should check with the maintainer whether the issue has already been patched or if a fix is in the works. Patches should be accompanied by test cases.

Asking a question

If you would like to ask a more general question there is the Spreadsheet::ParseExcel Google Group.

Report information
The Basics
Id: 81737
Status: open
Priority: 0/
Queue: Spreadsheet-ParseExcel

People
Owner: Nobody in particular
Requestors: ovid [...] cpan.org
Cc:
AdminCc:

Bug Information
Severity: Normal
Broken in: (no value)
Fixed in: (no value)



Subject: $cell->unformatted() does not handle UTF-8 correctly
MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
X-RT-Original-Encoding: utf-8
Content-Type: multipart/mixed; boundary="----------=_1354785576-8369-3"
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 0
Content-Type: text/plain; charset="UTF-8"
Content-Disposition: inline
Content-Transfer-Encoding: binary
Content-Length: 1688
Download (untitled) / with headers
text/plain 1.6k
Problem: $cell->value() correctly handles UTF-8 data but $cell->unformatted() does not. Steps to reproduce: 1. Create a spreadsheet and in cell A1 enter the following text: "мой первый медиаплана" (without the quotes). Save it as utf8.xls 2. Read this spreadsheet with the following program: use 5.10.0; use warnings; binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all use Spreadsheet::ParseExcel; my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls'); my @worksheets = $workbook->worksheets; my $cell = $worksheets[0]->get_cell( 0, 0 ); say "Value = ", $cell->value(); say "Unformatted = ", $cell->unformatted(); The output on my machine is as follows: Value = мой первый медиаплана Unformatted = <>9 ?5@2K9 <5480?;0=0 Extra information: I have a workaround for this, but I've attached a test script and an Excel file which demonstrates the problem. The Excel file was created with LibreOffice Calc, but I've observed this behavior with spreadsheets created with Microsoft Excel. Also: Perl version : 5.012002 OS name : linux Module versions: Spreadsheet::ParseExcel 0.59 Scalar::Util 1.23 Unicode::Map 0.112 Spreadsheet::WriteExcel 2.37 Parse::RecDescent 1.967006 File::Temp 0.22 OLE::Storage_Lite 0.19 IO::Stringy 2.110 Cheers, Ovid
Subject: xls.pl
MIME-Version: 1.0
Content-Type: application/octet-stream; name="xls.pl"
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline; filename="xls.pl"
Content-Transfer-Encoding: base64
Content-Length: 1063
Download xls.pl
text/x-perl 1k
use 5.10.0; use warnings; binmode STDOUT, ':encoding(UTF-8)'; # or use utf8::all use Spreadsheet::ParseExcel; my $workbook = Spreadsheet::ParseExcel->new->parse('utf8.xls'); my @worksheets = $workbook->worksheets; my $cell = $worksheets[0]->get_cell( 0, 0 ); say "Value = ", $cell->value(); say "Unformatted = ", $cell->unformatted(); say "Perl version : $]"; say "OS name : $^O"; say "Module versions: (not all are required)\n"; my @modules = qw( Spreadsheet::ParseExcel Scalar::Util Unicode::Map Spreadsheet::WriteExcel Parse::RecDescent File::Temp OLE::Storage_Lite IO::Stringy ); for my $module (@modules) { my $version; eval "require $module"; if ( not $@ ) { $version = $module->VERSION; $version = '(unknown)' if not defined $version; } else { $version = '(not installed)'; } printf "%21s%-24s\t%s\n", "", $module, $version; }
Subject: utf8.xls
MIME-Version: 1.0
Content-Type: application/vnd.ms-excel; name="utf8.xls"
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline; filename="utf8.xls"
Content-Transfer-Encoding: base64
Content-Length: 5632
Download utf8.xls
application/vnd.ms-excel 5.5k

Message body not shown because it is not plain text.

MIME-Version: 1.0
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-9642-1354788094-559.81737-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 763
Download (untitled) / with headers
text/plain 763b
On Thu Dec 06 04:19:36 2012, OVID wrote: Show quoted text
> $cell->value() correctly handles UTF-8 data but $cell->unformatted() > does not.
Hi Ovid, Thanks for the detailed bug report. This is expected behaviour (although clearly you didn't expected it). The unformatted function returns the raw data stored in Excel. It is used 99% of the time to get unformatted numeric data but for strings it returns the raw byte stream. In your case that is most likely UTF8-16LE but there are also some other, rarer, far-east encodings that the original author was interested in. I should probably update the docs on the unformatted method to explain the behaviour with strings. I've I've missed the issue here or if you have any other issues let me know. Regards, John.
MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-9642-1354788094-559.81737-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <rt-3.8.HEAD-9642-1354788094-559.81737-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-26949-1360582383-1565.81737-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 699
Download (untitled) / with headers
text/plain 699b
This isn't really expected behaviour from the documentation, which says (in Spreadsheet::ParseExcel::Cell) In general Spreadsheet::ParseExcel will return all character strings in UTF-8 regardless of the encoding used by Excel. Then the documentation for unformatted() says only that it "returns the cell value without a numeric format". If it is really intended that unformatted() should return raw bytes, it would be better to call it unformatted_bytes() or something like that. It would also be useful to have an unformatted_chars() method which does what the documentation currently says: return the value of the cell without numeric formatting applied, as a character string in UTF-8.
MIME-Version: 1.0
In-Reply-To: <rt-3.8.HEAD-26949-1360582383-1565.81737-0-0 [...] rt.cpan.org>
X-Mailer: MIME-tools 5.504 (Entity 5.504)
Content-Disposition: inline
X-RT-Interface: Web
References: <rt-3.8.HEAD-9642-1354788094-559.81737-0-0 [...] rt.cpan.org> <rt-3.8.HEAD-26949-1360582383-1565.81737-0-0 [...] rt.cpan.org>
Content-Type: text/plain; charset="utf-8"
Message-ID: <rt-4.0.18-27889-1394125836-432.81737-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
X-RT-Encrypt: 0
X-RT-Sign: 0
Content-Length: 2067
On Mon Feb 11 06:33:03 2013, EDAVIS wrote: Show quoted text
> This isn't really expected behaviour from the documentation, which says > (in Spreadsheet::ParseExcel::Cell) > > In general Spreadsheet::ParseExcel will return all character strings > in UTF-8 regardless of the encoding used by Excel. > > Then the documentation for unformatted() says only that it "returns the > cell value without a numeric format". > > If it is really intended that unformatted() should return raw bytes, it > would be better to call it unformatted_bytes() or something like that. > It would also be useful to have an unformatted_chars() method which > does what the documentation currently says: return the value of the cell > without numeric formatting applied, as a character string in UTF-8.
My $0.02 is that yes, unformatted() should have been called something else, perhaps unencoded() or raw() (and maybe I'll make an alias to that effect), but the original author probably thought of encoding as part of formatting (after all the routine that does the conversion from raw bytes to encoded characters is called TextFmt, and it didn't even handle unicode correctly until recently). I, like many others, use this module just to scrape data, so I understand the hassle of having to go to value() for the text (although I often go to unformatted() for everything since I don't get much unicode), and unformatted() for the numbers, and using ExcelFmt() on numbers that are dates (or depending on the unpredictable format you get from value())...it would be nice to have one method that gives you the encoded text, unformatted number, and a date in a standard date format (e.g. YYYY-MM-DD HH::MM::SS.FFF, and maybe just YYYY-MM-DD for numbers w/o a decimal part). Since distinguishing between number and date is somewhat of a guess, we can expect to get that wrong in some corner case, but I think it should be okay most of the time. I propose a new cell method data() for this...and leaving everything else as is, but improving the documentation as to what value() and unformatted() mean.


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.