This queue is for tickets about the Spreadsheet-XLSX CPAN distribution.

Report information
The Basics
Id:
108901
Status:
new
Priority:
Low/Low

People
Owner:
Nobody in particular
Requestors:
richard [...] rjlewis.me.uk
Cc:
AdminCc:

BugTracker
Severity:
(no value)
Broken in:
(no value)
Fixed in:
(no value)



MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue")
X-Spam-Status: No, score=-2.6 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_LOW=-0.7] autolearn=ham
X-Spam-Flag: NO
content-type: text/plain; charset="utf-8"
Message-ID: <85d1vd3jqn.wl-richard@rjlewis.me.uk>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Sasl-Enc: 1DRbtdntdR2eEzIfGNBfGFswx4prZwIq+PHtm2f1UCmo 1447434833
X-Spam-Score: -2.6
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 9B77324027A for <cpan-bug+Spreadsheet-XLSX@hipster.bestpractical.com>; Fri, 13 Nov 2015 12:14:10 -0500 (EST)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 25kKZFzBKveL for <cpan-bug+Spreadsheet-XLSX@hipster.bestpractical.com>; Fri, 13 Nov 2015 12:14:08 -0500 (EST)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 7E0982400C0 for <bug-Spreadsheet-XLSX@rt.cpan.org>; Fri, 13 Nov 2015 12:14:08 -0500 (EST)
Received: (qmail 27343 invoked by alias); 13 Nov 2015 17:14:07 -0000
Received: from out2-smtp.messagingengine.com (HELO out2-smtp.messagingengine.com) (66.111.4.26) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Fri, 13 Nov 2015 09:14:00 -0800
Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailout.nyi.internal (Postfix) with ESMTP id BBC1220DE6 for <bug-Spreadsheet-XLSX@rt.cpan.org>; Fri, 13 Nov 2015 12:13:53 -0500 (EST)
Received: from frontend2 ([10.202.2.161]) by compute6.internal (MEProxy); Fri, 13 Nov 2015 12:13:53 -0500
Received: from slab.rjlewis.me.uk.mail.messagingengine.com (unknown [158.223.51.88]) by mail.messagingengine.com (Postfix) with ESMTPA id 53ADC68013B for <bug-Spreadsheet-XLSX@rt.cpan.org>; Fri, 13 Nov 2015 12:13:53 -0500 (EST)
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i=@messagingengine.com
Delivered-To: cpan-bug+Spreadsheet-XLSX@hipster.bestpractical.com
User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM/1.14.9 (Gojō) APEL/10.8 EasyPG/1.0.0 Emacs/24.5 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO)
Subject: Handling namespace prefixes in OpenDocument XML
Return-Path: <richard.lewis@gold.ac.uk>
X-RT-Mail-Extension: spreadsheet-xlsx
X-Original-To: cpan-bug+Spreadsheet-XLSX@hipster.bestpractical.com
X-Spam-Check-BY: la.mx.develooper.com
Dkim-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d= messagingengine.com; h=content-type:date:from:message-id :mime-version:subject:to:x-sasl-enc:x-sasl-enc; s=smtpout; bh=bm sTrARHsHaBL6Xo6dXFpnS2rkg=; b=l0QzUIIPo91amAherRhjtvH8RScKlAgQuo VUbVih/OzF3Vh8YaddUqtey9OKxtZ1rM43EgH+u31RvZfAMqHNVcySgI7tYoMXs5 6sETAbh8UU9DIcdrR83qPUcKO7pv+mfaE1aeazT8kQX0q2OHUEX5azZHSSMIZ/9N sJUDwkQOw=
Date: Fri, 13 Nov 2015 17:13:52 +0000
X-Spam-Level:
To: bug-Spreadsheet-XLSX@rt.cpan.org
From: Richard Lewis <richard@rjlewis.me.uk>
X-RT-Original-Encoding: ascii
X-RT-Interface: Email
Content-Length: 2558
Hi there, I've been trying to import some XLSX spreadsheets and was finding that Spreadsheet::XLSX (v0.15) couldn't find the worksheets in the file, and then later that it couldn't retrieve any of the cell values. I run my script in the debugger and stepped through the Spreadsheet::XLSX->_load_workbook subroutine, looking especially at the loop which begins: foreach ($member_workbook -> contents =~ /\<(.*?)\/?\>/g) { The first line of this look is a pattern match: /^(\w+)\s+/; which is the first word inside the tag, followed by everthing else. Now, for tags such as: <sheet name="Sheet 3" sheetId="3" r:id="rId3" /> this works OK, because "sheet" will match. And then the: $tag eq 'sheet' or next; test will pass. However, in my XLSX file I found that the xl/workbook.xml member file was encoded with namespace prefixes for all the tags; I had, for example: <x:sheet name="Sheet 3" sheetId="3" r:id="rId3" /> where the x namespace was defined in the root node like this: <x:workbook [...] xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> Consequently, the /^(\w+)\s+/ pattern did not match "x:sheet", and so none of the sheets in the workbook were found. Simply changing the pattern to: /^x:(\w+)\s+/ would fix the problem for my particular spreadsheet. But it's not a correct solution as there's no requirement that the workbook.xml file use namespace prefixes, and there's also no requirement that any prefix must be called "x". After experimenting with this, I went on to find that some (but not all!) of the xl/worksheets/sheet*.xml files used namespace prefixes and so had problems retrieving the cell values. I started trying some fix-ups in the region of: my $parsing_v_tag = 0; my $s = 0; my $s2 = 0; my $sty = 0; foreach ($member_sheet->contents =~ /(\<.*?\/?\>|.*?(?=\<))/g) { if (/^\<c\s*.*?\s*r=\"([A-Z])([A-Z]?)(\d+)\"/) { ($row, $col) = __decode_cell_name($1, $2, $3); but eventually got too confused! Of course, this comes about as a result of processing XML using text techniques (i.e. regexes). While I definitely see the performance advantages of this (over using a library to build a DOM, for example), we do have these drawbacks of having to account for all the possibilities of XML serialisation in the wild. Any thoughts on how we might be able to get this fixed? I guess some careful re-working of the regexes might be sufficient. Or possibly re-writing to use an XML parser; maybe one with a SAX API? Richard -- http://web.rjlewis.me.uk/


This service runs on Request Tracker, is sponsored by The Perl Foundation, and maintained by Best Practical Solutions.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.