Skip Menu |

This queue is for tickets about the Text-Corpus-VoiceOfAmerica CPAN distribution.

Report information
The Basics
Id: 89625
Status: new
Priority: 0/
Queue: Text-Corpus-VoiceOfAmerica

Owner: Nobody in particular
Requestors: billniebel [...]

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)

CC: jeff.kubina [...]
Subject: doesn't expect what VOA now sends: a sitemap *index* (and with individual sitemaps .xml.gz)
Date: Sat, 19 Oct 2013 13:29:36 -0400
To: bug-Text-Corpus-VoiceOfAmerica [...]
From: William Niebel <billniebel [...]>
Jeff- I recognized that the Voice of America transcripts would help me do text analysis. I've used Perl for many years and so was happy to find your Text::Corpus::VoiceOfAmerica and installed it this morning. It looks like it will save me lots of time. Thanks. I didn't find a bug per se, but the Perl module seems to no longer work because of a change made at VOA. I looks like it expects a simple sitemap file from '' and in fact VOA still responds 200 with content but now returns a sitemap *index* file instead. Simple test script output includes "no urls found via XML parsing, 14 found using regular expression." because sitemap index uses loc tag, but not in the sitemap nesting '/x:urlset/x:url/x:loc' Another complication: the several individual sitemap files, referenced by the VOA-returned sitemap, are now all .xml.gz, not simply .xml I'll tarry a bit in case you jump on this, and can fashion a workaround if not. Again, many thanks for your module. I'm looking forward to using it. -Bill

This service is sponsored and maintained by Best Practical Solutions and runs on infrastructure.

Please report any issues with to