This queue is for tickets about the Spreadsheet-XLSX CPAN distribution.

Report information
The Basics
Id:
43247
Status:
new
Priority:
Low/Low

People
Owner:
Nobody in particular
Requestors:
okina [...] is.s.u-tokyo.ac.jp
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



MIME-Version: 1.0
X-Authentication-Warning: www683.sakura.ne.jp: cuma set sender to okina@cuma.sakura.ne.jp using -f
X-Spam-Status: No, hits=0.0 required=8.0 tests=
content-type: text/plain; charset="utf-8"
Message-ID: <200902130224.n1D2OSAA099672@www683.sakura.ne.jp>
Received: from la.mx.develooper.com (x1.develooper.com [63.251.223.170]) by diesel.bestpractical.com (Postfix) with SMTP id 7AFBC63C028 for <bug-Spreadsheet-XLSX@rt.cpan.org>; Thu, 12 Feb 2009 21:25:29 -0500 (EST)
Received: (qmail 13200 invoked by uid 103); 13 Feb 2009 02:25:25 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 13 Feb 2009 02:25:25 -0000
Received: from www683.sakura.ne.jp (HELO www683.sakura.ne.jp) (59.106.19.133) by 16.mx.develooper.com (qpsmtpd/0.43rc1) with ESMTP; Thu, 12 Feb 2009 18:24:59 -0800
Received: from www683.sakura.ne.jp (localhost [127.0.0.1]) by www683.sakura.ne.jp (8.13.6/8.13.6) with ESMTP id n1D2OS55099673 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for <bug-Spreadsheet-XLSX@rt.cpan.org>; Fri, 13 Feb 2009 11:24:28 +0900 (JST) (envelope-from okina@cuma.sakura.ne.jp)
Received: (from cuma@localhost) by www683.sakura.ne.jp (8.13.6/8.13.6/Submit) id n1D2OSAA099672 for bug-Spreadsheet-XLSX@rt.cpan.org; Fri, 13 Feb 2009 11:24:28 +0900 (JST) (envelope-from okina@cuma.sakura.ne.jp)
Delivered-To: cpan-bug+Spreadsheet-XLSX@diesel.bestpractical.com
Subject: two problems in treating shared string table
Return-Path: <okina@cuma.sakura.ne.jp>
X-Original-To: bug-Spreadsheet-XLSX@rt.cpan.org
X-Spam-Check-BY: 16.mx.develooper.com
Date: Fri, 13 Feb 2009 11:24:28 +0900
X-Spam-Level: *
To: bug-Spreadsheet-XLSX@rt.cpan.org
Content-Transfer-Encoding: 7bit
From: okina@is.s.u-tokyo.ac.jp
X-RT-Original-Encoding: ISO-2022-JP
Content-Length: 2652
Hi, Trying to load the excel 2007 file, I encountered two problems below. So I send you a patch. My environment is: * Spreadsheet-XLSX-0.09 * Linux version 2.6.9-023stab040.1-enterprise (root@rhel4-32) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP Mon Jan 15 22:56:55 MSK 2007 * perl, v5.8.5 built for i386-linux-thread-multi. 1) The loaded context includes character entity references literally. 2) Due to existence of 'Phonetic Properties' items for Japanese excel files, Spreadsheet::XLSX misaligns the indices of items in the shared string table. Phonetic items represents pronunciation hints for some East Asian languages. In the file 'xl/sharedStrings.xml', the phonetic properties appear like: <si> <t>(a japanese text in KANJI)</t> <rPh sb="0" eb="1"> <t>(its pronounciation in KATAKANA)</t> </rPh> </si> Then, the routine in Spreadsheet::XLSX::new(), foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) , wrongly extracts the phonetic items as normal string items, by only searching '<t>' tag. This problem is not a special case, but may express at many XLSX files created by Japanese version of Excel, because phonetic properties are inserted automatically by Excel(and IME). * See details for the file formats of OOXML in: http://www.ecma-international.org/publications/standards/Ecma-376.htm The section '1st edition Part 4' states its markup language reference. According to the reference, this problem can be caused only by '<rPh>' tags. Therefore, I wrote a simple patch for fixing these bugs. Note that I think that it's acceptable to ignore such phonetic items in your simple implementation. =============== --- XLSX.pm.orig 2009-01-26 16:02:19.000000000 +0900 +++ XLSX.pm 2009-02-13 01:52:19.000000000 +0900 @@ -12,6 +12,7 @@ use Spreadsheet::XLSX::Fmt2007; use Data::Dumper; use Spreadsheet::ParseExcel; +use CGI; ################################################################################ @@ -31,9 +32,11 @@ my $mstr = $member_shared_strings->contents; $mstr =~ s/<t\/>/<t><\/t>/gsm; # this handles an empty t tag in the xml <t/> + $mstr =~ s%<rPh.*?>(.*?)</rPh>%%gsm; # ignores phonetic properties #foreach my $t ($member_shared_strings -> contents =~ /t\>([^\<]*)\<\/t/gsm) { foreach my $t ($mstr =~ /<t.*?>(.*?)<\/t/gsm) { + $t = CGI::unescapeHTML($t); $t = $converter -> convert ($t) if $converter; push @shared_strings, $t; =============== Regards, //---- Kazumasa Kotani e-mail: okina@is.s.u-tokyo.ac.jp


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.