MIME-Version: | 1.0 |
X-Authentication-Warning: | www683.sakura.ne.jp: cuma set sender to okina@cuma.sakura.ne.jp using -f |
X-Spam-Status: | No, hits=0.0 required=8.0 tests= |
content-type: | text/plain; charset="utf-8" |
Message-ID: | <200902130224.n1D2OSAA099672@www683.sakura.ne.jp> |
Received: | from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 7AFBC63C028 for <bug-Spreadsheet-XLSX@rt.cpan.org>; Thu, 12 Feb 2009 21:25:29 -0500 (EST) |
Received: | (qmail 13200 invoked by uid 103); 13 Feb 2009 02:25:25 -0000 |
Received: | from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 Feb 2009 02:25:25 -0000 |
Received: | from www683.sakura.ne.jp (HELO www683.sakura.ne.jp) (59.106.19.133) by 16.mx.develooper.com (qpsmtpd/0.43rc1) with ESMTP; Thu, 12 Feb 2009 18:24:59 -0800 |
Received: | from www683.sakura.ne.jp (localhost [127.0.0.1]) by www683.sakura.ne.jp (8.13.6/8.13.6) with ESMTP id n1D2OS55099673 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <bug-Spreadsheet-XLSX@rt.cpan.org>; Fri, 13 Feb 2009 11:24:28 +0900 (JST) (envelope-from okina@cuma.sakura.ne.jp) |
Received: | (from cuma@localhost) by www683.sakura.ne.jp (8.13.6/8.13.6/Submit) id n1D2OSAA099672 for bug-Spreadsheet-XLSX@rt.cpan.org; Fri, 13 Feb 2009 11:24:28 +0900 (JST) (envelope-from okina@cuma.sakura.ne.jp) |
Delivered-To: | cpan-bug+Spreadsheet-XLSX@diesel.bestpractical.com |
Subject: | two problems in treating shared string table |
Return-Path: | <okina@cuma.sakura.ne.jp> |
X-Original-To: | bug-Spreadsheet-XLSX@rt.cpan.org |
X-Spam-Check-BY: | 16.mx.develooper.com |
Date: | Fri, 13 Feb 2009 11:24:28 +0900 |
X-Spam-Level: | * |
To: | bug-Spreadsheet-XLSX@rt.cpan.org |
Content-Transfer-Encoding: | 7bit |
From: | okina@is.s.u-tokyo.ac.jp |
X-RT-Original-Encoding: | ISO-2022-JP |
Content-Length: | 2652 |
Hi,
Trying to load the excel 2007 file, I encountered two problems below.
So I send you a patch.
My environment is:
* Spreadsheet-XLSX-0.09
* Linux version 2.6.9-023stab040.1-enterprise (root@rhel4-32)
(gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP
Mon Jan 15 22:56:55 MSK 2007
* perl, v5.8.5 built for i386-linux-thread-multi.
1)
The loaded context includes character entity references literally.
2)
Due to existence of 'Phonetic Properties' items for Japanese excel files,
Spreadsheet::XLSX misaligns the indices of items in the shared string table.
Phonetic items represents pronunciation hints for some East Asian languages.
In the file 'xl/sharedStrings.xml', the phonetic properties appear like:
<si>
<t>(a japanese text in KANJI)</t>
<rPh sb="0" eb="1">
<t>(its pronounciation in KATAKANA)</t>
</rPh>
</si>
Then, the routine in Spreadsheet::XLSX::new(),
foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) ,
wrongly extracts the phonetic items as normal string items,
by only searching '<t>' tag.
This problem is not a special case, but may express at many XLSX files
created by Japanese version of Excel, because phonetic properties
are inserted automatically by Excel(and IME).
* See details for the file formats of OOXML in:
http://www.ecma-international.org/publications/standards/Ecma-376.htm
The section '1st edition Part 4' states its markup language reference.
According to the reference, this problem can be caused only by '<rPh>' tags.
Therefore, I wrote a simple patch for fixing these bugs.
Note that I think that it's acceptable to ignore such phonetic items
in your simple implementation.
===============
--- XLSX.pm.orig 2009-01-26 16:02:19.000000000 +0900
+++ XLSX.pm 2009-02-13 01:52:19.000000000 +0900
@@ -12,6 +12,7 @@
use Spreadsheet::XLSX::Fmt2007;
use Data::Dumper;
use Spreadsheet::ParseExcel;
+use CGI;
################################################################################
@@ -31,9 +32,11 @@
my $mstr = $member_shared_strings->contents;
$mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag in the xml <t/>
+ $mstr =~ s%<rPh.*?>(.*?)</rPh>%%gsm; # ignores phonetic properties
#foreach my $t ($member_shared_strings -> contents =~ /t\>([^\<]*)\<\/t/gsm) {
foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) {
+ $t = CGI::unescapeHTML($t);
$t = $converter -> convert ($t) if $converter;
push @shared_strings, $t;
===============
Regards,
//----
Kazumasa Kotani
e-mail: okina@is.s.u-tokyo.ac.jp