Skip Menu |
 

This queue is for tickets about the MARC-Record CPAN distribution.

Report information
The Basics
Id: 70169
Status: open
Priority: 0/
Queue: MARC-Record

People
Owner: Nobody in particular
Requestors: m.e.phillips [...] durham.ac.uk
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



From m.e.phillips [...] durham.ac.uk Tue Aug 9 10: 12:49 2011
MIME-Version: 1.0
X-Spam-Status: No, score=-6.235 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, SPF_SOFTFAIL=0.665] autolearn=ham
Content-Class: urn:content-classes:message
X-Durhamacuk-Mailscanner: Found to be clean
X-Spam-Flag: NO
Message-ID: <1F5DB00D61AF1A479A6F8572FAC9ED80029121E6 [...] DURMAIL4.mds.ad.dur.ac.uk>
content-type: text/plain; charset="utf-8"
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-MS-Tnef-Correlator:
X-Spam-Score: -6.235
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id C9FE2240570 for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Tue, 9 Aug 2011 10:12:49 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 759T2nlR83XW for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Tue, 9 Aug 2011 10:12:47 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 2498F2406BB for <bug-MARC-Record [...] rt.cpan.org>; Tue, 9 Aug 2011 10:12:46 -0400 (EDT)
Received: (qmail 9498 invoked by uid 103); 9 Aug 2011 14:12:46 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 9 Aug 2011 14:12:46 -0000
Received: from hermes2.dur.ac.uk (HELO hermes2.dur.ac.uk) (129.234.248.2) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Tue, 09 Aug 2011 07:12:43 -0700
Received: from DURMAIL4.mds.ad.dur.ac.uk (durmail4b.dur.ac.uk [10.234.250.17]) by hermes2.dur.ac.uk (8.13.8/8.13.7) with ESMTP id p79ECPhA003290 for <bug-MARC-Record [...] rt.cpan.org>; Tue, 9 Aug 2011 15:12:29 +0100
Delivered-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
Subject: MARC::File::USMARC gets tripped up if fields contain 0x1D
Return-Path: <m.e.phillips [...] durham.ac.uk>
Thread-Index: Acwh13cvsCoKSBXSRyGYTP6tao6A7Q0xkC3Q
X-RT-Mail-Extension: marc-record
X-Original-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
X-Spam-Check-BY: 16.mx.develooper.com
X-Durhamacuk-Mailscanner-ID: p79ECPhA003290
Date: Tue, 9 Aug 2011 15:12:24 +0100
X-Spam-Level:
Thread-Topic: MARC::File::USMARC gets tripped up if fields contain 0x1D
X-MS-Has-Attach:
X-Mimeole: Produced By Microsoft Exchange V6.5
To: <bug-MARC-Record [...] rt.cpan.org>
Content-Transfer-Encoding: quoted-printable
From: "PHILLIPS M.E." <m.e.phillips [...] durham.ac.uk>
X-RT-Original-Encoding: US-ASCII
Content-Length: 1475
Download (untitled) / with headers
text/plain 1.4k
I have been using the MARC::Record Perl module to process some MARC records exported from Millennium. For some reason, a few records actually have the character 0x1D as part of field values, not just as an end of record marker. These can occur because Millennium extends the multi-byte character encoding of CJK to allow arbitrary 16-bit Unicode characters to appear. We mainly see this with directional quotes pasted into our records by cataloguers. Anyhow, MARC::File::USMARC gets tripped up by this because in "sub next" the record is read by setting $/ to 0x1D and reading a "line" from the file: local $/ = END_OF_RECORD; my $usmarc = <$fh>; I found that by replacing those two lines with the following I was able to overcome the problem: my $length; read($fh, $length, 5) || return; return unless $length>=5; my $record; read($fh, $record, $length-5) || return; my $usmarc = $length.$record; This works by reading the first five bytes of the record, which signify the record length, and then reading the remaining number of bytes as stipulated by the record length. Perhaps you might consider incorporating this change into the next version of MARC::File::USMARC? Other than this minor niggle, I find the MARC::Record module to be a really powerful tool: great stuff! Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941
MIME-Version: 1.0
In-Reply-To: <1F5DB00D61AF1A479A6F8572FAC9ED80029121E6 [...] DURMAIL4.mds.ad.dur.ac.uk>
X-Mailer: MIME-tools 5.427 (Entity 5.427)
Content-Disposition: inline
References: <1F5DB00D61AF1A479A6F8572FAC9ED80029121E6 [...] DURMAIL4.mds.ad.dur.ac.uk>
Content-Type: text/plain; charset="UTF-8"
Message-ID: <rt-3.8.HEAD-22510-1312902051-633.70169-0-0 [...] rt.cpan.org>
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 1393
Download (untitled) / with headers
text/plain 1.3k
Hi, On Tue Aug 09 10:12:50 2011, m.e.phillips@durham.ac.uk wrote: Show quoted text
> I have been using the MARC::Record Perl module to process some MARC > records exported from Millennium. For some reason, a few records > actually have the character 0x1D as part of field values, not just as
an Show quoted text
> end of record marker. These can occur because Millennium extends the > multi-byte character encoding of CJK to allow arbitrary 16-bit Unicode > characters to appear. We mainly see this with directional quotes
pasted Show quoted text
> into our records by cataloguers.
Could you attach such a record for use as a test case? I also maintain MARC::Charset, so I'm also interested in the III character encoding extensions in general. Show quoted text
> This works by reading the first five bytes of the record, which
signify Show quoted text
> the record length, and then reading the remaining number of bytes as > stipulated by the record length. > > Perhaps you might consider incorporating this change into the next > version of MARC::File::USMARC?
Yes, though there will need to be a switch controlling how MARC::File::USMARC slurps records, since unfortunately there are plenty of MARC records in the wild whose Leader/00-04 is not trustworthy but where splitting on \x1D and (loosely) parsing the record can be made to work. Show quoted text
> Other than this minor niggle, I find the MARC::Record module to be a > really powerful tool: great stuff!
Thanks!
From m.e.phillips [...] durham.ac.uk Wed Aug 10 11: 52:47 2011
MIME-Version: 1.0
X-Spam-Status: No, score=-6.235 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, RCVD_IN_DNSWL_HI=-5, SPF_SOFTFAIL=0.665] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-22510-1312902052-1514.70169-6-0 [...] rt.cpan.org>
Content-Class: urn:content-classes:message
X-Spam-Flag: NO
X-Durhamacuk-Mailscanner: Found to be clean
References: <RT-Ticket-70169 [...] rt.cpan.org> <1F5DB00D61AF1A479A6F8572FAC9ED80029121E6 [...] DURMAIL4.mds.ad.dur.ac.uk> <rt-3.8.HEAD-22510-1312902052-1514.70169-6-0 [...] rt.cpan.org>
X-Virus-Checked: Checked by ClamAV on 16.mx.develooper.com
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Content-Type: multipart/mixed; boundary="----_=_NextPart_001_01CC5775.7B080087"
Message-ID: <1F5DB00D61AF1A479A6F8572FAC9ED800291225A [...] DURMAIL4.mds.ad.dur.ac.uk>
X-MS-Tnef-Correlator:
X-Spam-Score: -6.235
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id B8477240765 for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Wed, 10 Aug 2011 11:52:47 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id EC067rOiIRh5 for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Wed, 10 Aug 2011 11:52:45 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id D24FE240751 for <bug-MARC-Record [...] rt.cpan.org>; Wed, 10 Aug 2011 11:52:44 -0400 (EDT)
Received: (qmail 8112 invoked by uid 103); 10 Aug 2011 15:52:43 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 10 Aug 2011 15:52:43 -0000
Received: from hermes1.dur.ac.uk (HELO hermes1.dur.ac.uk) (129.234.248.1) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Wed, 10 Aug 2011 08:52:40 -0700
Received: from DURMAIL4.mds.ad.dur.ac.uk (durmail4b.dur.ac.uk [10.234.250.17]) by hermes1.dur.ac.uk (8.13.8/8.13.7) with ESMTP id p7AFqG7P027639 for <bug-MARC-Record [...] rt.cpan.org>; Wed, 10 Aug 2011 16:52:20 +0100
Delivered-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
Subject: RE: [rt.cpan.org #70169] MARC::File::USMARC gets tripped up if fields contain 0x1D
Return-Path: <m.e.phillips [...] durham.ac.uk>
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
X-RT-Mail-Extension: marc-record
Thread-Index: AcxWpb8sR3/ORzUiRt+Wjv+fuI6VqAAyrVQA
X-Durhamacuk-Mailscanner-ID: p7AFqG7P027639
Date: Wed, 10 Aug 2011 16:52:16 +0100
X-Spam-Level:
X-MS-Has-Attach: yes
Thread-Topic: [rt.cpan.org #70169] MARC::File::USMARC gets tripped up if fields contain 0x1D
X-Mimeole: Produced By Microsoft Exchange V6.5
To: <bug-MARC-Record [...] rt.cpan.org>
From: "PHILLIPS M.E." <m.e.phillips [...] durham.ac.uk>
RT-Message-ID: <rt-3.8.HEAD-22516-1312991568-1986.70169-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: utf-8
Content-Length: 1897
Download (untitled) / with headers
text/plain 1.8k
Show quoted text
> Could you attach such a record for use as a test case? I also maintain > MARC::Charset, so I'm also interested in the III character encoding > extensions in general.
I've attached a zip file, clergy.zip, which contains clergy.out, a file with a single unblocked MARC record output from Millennium. The record can be seen on our OPAC at http://library.dur.ac.uk/record=b2660297~S1 Rather than hunt for a record containing a 0x1d in the field data I have cheated by doctoring this record. The 0x1d appears as part of the closing double quotes round the words "Online Journal" in the 520 note field. Here is an excerpt using hexdump -C: 000008a0 28 42 4f 6e 6c 69 6e 65 20 4a 6f 75 72 6e 61 6c |(BOnline Journal| 000008b0 1b 24 31 7f 20 1d 1b 28 42 20 63 6f 6e 74 61 69 |.$1. ..(B contai| It appears that Millennium subverts the CJK character set in order to put 16-bit Unicode characters into the records. The sequence 1b 24 31 7f 20 1d 1b 28 42 equates to: 1B 24 31 = set G0 to CJK character set 7f 20 1d = invalid CJK code, made up of 7f followed by 20 1d (big-endian UTF-16 code) 1b 28 42 = set G0 to Basic Latin (ASCII) Show quoted text
> Yes, though there will need to be a switch controlling how > MARC::File::USMARC slurps records, since unfortunately there are plenty > of MARC records in the wild whose Leader/00-04 is not trustworthy but > where splitting on \x1D and (loosely) parsing the record can be made to > work.
Yes, I'd forgotten that problem, which I have met before! Another approach would be to check the record length by examining the directory, which has to be pretty accurate in order to parse the fields at all. Incidentally, could I contact you via e-mail to ask one or two questions about MARC::Charset as I am a bit puzzled by the implementation in one or two places. Is your gmail address as shown on CPAN the best way? Matthew
Content-Description: clergy.zip
content-type: application/x-zip-compressed; name="clergy.zip"
content-disposition: attachment; filename="clergy.zip"
Content-Transfer-Encoding: base64
Content-Length: 1425
Download clergy.zip
application/x-zip-compressed 1.3k

Message body not shown because it is not plain text.

From gmcharlt [...] gmail.com Wed Aug 10 12: 46:05 2011
MIME-Version: 1.0
X-Spam-Status: No, score=-6.21 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_HI=-5, SPF_NEUTRAL=0.779, T_TO_NO_BRKTS_FREEMAIL=0.01] autolearn=ham
In-Reply-To: <rt-3.8.HEAD-22516-1312991569-390.70169-5-0 [...] rt.cpan.org>
X-Spam-Flag: NO
References: <RT-Ticket-70169 [...] rt.cpan.org> <1F5DB00D61AF1A479A6F8572FAC9ED80029121E6 [...] DURMAIL4.mds.ad.dur.ac.uk> <rt-3.8.HEAD-22510-1312902052-1514.70169-6-0 [...] rt.cpan.org> <1F5DB00D61AF1A479A6F8572FAC9ED800291225A [...] DURMAIL4.mds.ad.dur.ac.uk> <rt-3.8.HEAD-22516-1312991569-390.70169-5-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
Message-ID: <CALXNrD-nVhAThMXpt0WrJvOcJLh9yjV3mBWk6NKGey4kcDgzGw [...] mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
X-RT-Original-Encoding: utf-8
X-Spam-Score: -6.21
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 87DC2240716 for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Wed, 10 Aug 2011 12:46:05 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LDKKBWfmQzWV for <cpan-bug+MARC-Record [...] hipster.bestpractical.com>; Wed, 10 Aug 2011 12:46:03 -0400 (EDT)
Received: from la.mx.develooper.com (x1.develooper.com [207.171.7.70]) by hipster.bestpractical.com (Postfix) with SMTP id 9A3A72404F0 for <bug-MARC-Record [...] rt.cpan.org>; Wed, 10 Aug 2011 12:46:03 -0400 (EDT)
Received: (qmail 12522 invoked by uid 103); 10 Aug 2011 16:46:03 -0000
Received: from x16.dev (10.0.100.26) by x1.dev with QMQP; 10 Aug 2011 16:46:03 -0000
Received: from mail-yx0-f178.google.com (HELO mail-yx0-f178.google.com) (209.85.213.178) by 16.mx.develooper.com (qpsmtpd/0.80/v0.80-19-gf52d165) with ESMTP; Wed, 10 Aug 2011 09:46:00 -0700
Received: by yxm8 with SMTP id 8so784013yxm.9 for <bug-MARC-Record [...] rt.cpan.org>; Wed, 10 Aug 2011 09:45:57 -0700 (PDT)
Received: by 10.236.79.3 with SMTP id h3mr2941383yhe.197.1312994757471; Wed, 10 Aug 2011 09:45:57 -0700 (PDT)
Received: by 10.236.42.196 with HTTP; Wed, 10 Aug 2011 09:45:57 -0700 (PDT)
Delivered-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #70169] MARC::File::USMARC gets tripped up if fields contain 0x1D
Return-Path: <gmcharlt [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=cTnE0TpCCChOgVjUTXqR/EN/xmWsvY6qPMMZDyKeYDk=; b=SSQIwWNsuywVy9Kr9hqX+8Tx6e0esB3WefOaxRP5g0UvNeb0itEcHzEYpKwwa+8Kmx 1od40moc4u+abVY72ylj7i+o0cn/QECJ1hVhP990wkbTvRHVVDBJvJ5ASXUwW5oOWI94 R+MMANm5U3dX1X8ZoAaWXv96WOd0DMOzOWmd4=
X-Spam-Check-BY: 16.mx.develooper.com
X-Original-To: cpan-bug+MARC-Record [...] hipster.bestpractical.com
X-RT-Mail-Extension: marc-record
Date: Wed, 10 Aug 2011 12:45:57 -0400
X-Spam-Level:
To: bug-MARC-Record [...] rt.cpan.org
Content-Transfer-Encoding: quoted-printable
From: Galen Charlton <gmcharlt [...] gmail.com>
RT-Message-ID: <rt-3.8.HEAD-22510-1312994766-1329.70169-0-0 [...] rt.cpan.org>
Content-Length: 1423
Download (untitled) / with headers
text/plain 1.3k
Hi, On Wed, Aug 10, 2011 at 11:52 AM, PHILLIPS M.E. via RT <bug-MARC-Record@rt.cpan.org> wrote: Show quoted text
>       Queue: MARC-Record >  Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=70169 > > I've attached a zip file, clergy.zip, which contains clergy.out, a file with a single unblocked MARC record output from Millennium.  The record can be seen on our OPAC at http://library.dur.ac.uk/record=b2660297~S1
Thanks for supplying the example and the additional information regarding III's hack of MARC-8. Show quoted text
> Yes, I'd forgotten that problem, which I have met before!  Another approach would be to check the record length by examining the directory, which has to be pretty accurate in order to parse the fields at all.
You'd be surprised. I've run into cases where the length and offset values in the directory were completely long, but as long as the number of directory entries corresponds to the number of field terminator characters, I've been able to successfully parse such records. Might be worth adding a parsing mode to MARC::File::USMARC to support that, not that encouraging such sloppy MARC records is a good idea. :) Show quoted text
> Incidentally, could I contact you via e-mail to ask one or two questions about MARC::Charset as I am a bit puzzled by the implementation in one or two places.  Is your gmail address as shown on CPAN the best way?
Yes, it is. Regards, Galen -- Galen Charlton gmcharlt@gmail.com


This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.