Skip Menu |
 

This queue is for tickets about the Template-Generate CPAN distribution.

Report information
The Basics
Id: 129481
Status: new
Priority: 0/
Queue: Template-Generate

People
Owner: Nobody in particular
Requestors: jacklangsdorf [...] gmail.com
Cc:
AdminCc:

Bug Information
Severity: (no value)
Broken in: (no value)
Fixed in: (no value)



MIME-Version: 1.0
X-Spam-Status: No, score=-0.54 tagged_above=-99.9 required=10 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_SOFTFAIL=0.665] autolearn=no
X-Cpan.org: This message routed through the cpan.org mail forwarding service. Please use PAUSE pause.perl.org to configure your delivery settings.
X-Spam-Flag: NO
Content-Type: multipart/mixed; boundary="00000000000017996305884e6d73"
Message-ID: <CAK_WbYgW8BuwE+ka7DJNwiS9jii=5dJTiXPPFH13A9EPDb2ixA [...] mail.gmail.com>
X-Received: by 2002:a1c:c004:: with SMTP id q4mr20299011wmf.131.1557245302619; Tue, 07 May 2019 09:08:22 -0700 (PDT)
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Spam-Score: -0.54
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 9A63D2401D1 for <cpan-bug+Template-Generate [...] hipster.bestpractical.com>; Tue, 7 May 2019 12:08:30 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PymoLmips7ox for <cpan-bug+Template-Generate [...] hipster.bestpractical.com>; Tue, 7 May 2019 12:08:28 -0400 (EDT)
Received: from xx1.develooper.com (unknown [147.75.38.233]) by hipster.bestpractical.com (Postfix) with ESMTPS id 561C62400E0 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 7 May 2019 12:08:28 -0400 (EDT)
Received: from localhost (xx1.develooper.com [127.0.0.1]) by localhost (Postfix) with ESMTP id 473867C100 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 7 May 2019 09:08:27 -0700 (PDT)
Received: from xx1.develooper.com (xx1.develooper.com [127.0.0.1]) by localhost (Postfix) with SMTP id 371CF7C1C8 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 7 May 2019 09:08:25 -0700 (PDT)
Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by xx1.develooper.com (Postfix) with ESMTPS id A25477C1AE for <bug-Template-Generate [...] rt.cpan.org>; Tue, 7 May 2019 09:08:22 -0700 (PDT)
Received: by mail-wm1-f47.google.com with SMTP id 198so3322021wme.3 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 07 May 2019 09:08:23 -0700 (PDT)
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
X-Google-SMTP-Source: APXvYqxmZc+uWKhG39t2xxGfNO5dII7qpRyvkPq2pCvI41BhrFTwqvm1UVcNUZOlNaVRmyF+HFXuSiO4lF0RIyTjgjQ=
Delivered-To: cpan-bug+Template-Generate [...] hipster.bestpractical.com
Subject: code to handle [% ... %] in Template::Generate
Return-Path: <jacklangsdorf [...] gmail.com>
X-RT-Mail-Extension: template-generate
X-Original-To: cpan-bug+Template-Generate [...] hipster.bestpractical.com
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=B++RC2kO/N1vnmi7pgQgfX4l45Deg2xeF8l+PYri8uM=; b=STxYrnXwUTzT03CKGrMgJ5dKrKokEcgBx7LXKm9CntvFtf8FMS1rTAudqZ1NVrxPTr kvcBSG4aA3bbdD6XFp9MiclTDtm8A5wNfdmc8jVB2glpu2FTe2a5D86mUnk746IatVv7 FllnDBKbzjqPDaTbW6lsLMExxc9jcC2j+EjzxAZh9R4QiE2unpxGcxw8mApz0aPHaITJ SNdXUaMjFGPGJMGS097m8Yu3seKcEMhqEP/CWABPgURwMoGJWJOBLGaOa1otm1JTm8L8 mYj094DTA8+vVWyUVyX9FtmSc70mxOf4Ja2z3vD7epJvQ32g7RfIiHq7jul1+V5Vgj7t LOlQ==
X-Google-Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=B++RC2kO/N1vnmi7pgQgfX4l45Deg2xeF8l+PYri8uM=; b=UDoXYe+NSiF6qo6K9AYA7cwaLfbhhILFmLh7DK5vC7v0oRa20Ltn8mLzueKcjJ5oIW 5p9tQFkznCT0XCACs3phNUZPvdTLQXB5+Fb+YXivI7+dSvtTKL9/HdMBF1bQpWiTa5Jv vLMzHZPuQUVHaz1qKTWLY5EOQB9oTx0QJvCxZz9BvcdPmdvWUq4sg9K8VxGDa0qNw+nc Wpu4/O45w8ou/kMSOy5tMpO5eOMCMMLQr1FAhx+uIIEo9Tj2PObjIOz04lW0c06GHnKT xOeyl4mke9lj2xgFlauQE3c6ECuXvw+3lZ4s628E12T8l1V7mZpuxADDtcf009FheeWz Q75w==
X-PMX-Spam: Gauge=X, Probability=10%, Report=' BASE64_ENC_TEXT 0.5, HTML_50_70 0.1, FRAUD_ATTACH 0.05, BODYTEXTH_SIZE_10000_LESS 0, BODYTEXTH_SIZE_3000_MORE 0, BODYTEXTP_SIZE_3000_LESS 0, BODY_SIZE_10000_PLUS 0, DATE_TZ_NA 0, DKIM_ALIGNS 0, DKIM_SIGNATURE 0, HREF_LABEL_TEXT_NO_URI 0, HREF_LABEL_TEXT_ONLY 0, LEGITIMATE_SIGNS 0, MSG_THREAD 0, NO_URI_HTTPS 0, SPF_PASS 0, TXT_ATTACHED 0, WEBMAIL_SOURCE 0, __ANY_URI 0, __BODY_TEXT_X4 0, __CT 0, __CTYPE_HAS_BOUNDARY 0, __CTYPE_MULTIPART 0, __CTYPE_MULTIPART_MIXED 0, __DKIM_ALIGNS_1 0, __DKIM_ALIGNS_2 0, __DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0, __FRAUD_COMMON 0, __FRAUD_MONEY_CURRENCY 0, __FRAUD_MONEY_CURRENCY_DOLLAR 0, __FRAUD_REFNUM 0, __FRAUD_WEBMAIL 0, __FRAUD_WEBMAIL_FROM 0, __FROM_GMAIL 0, __FUR_RDNS_GMAIL 0, __HAS_ATTACHMENT 0, __HAS_ATTACHMENT1 0, __HAS_FROM 0, __HAS_HTML 0, __HAS_MSGID 0, __HELO_GMAIL 0, __HEX28_LC_BOUNDARY 0, __HREF_LABEL_TEXT 0, __HTML_AHREF_TAG 0, __HTML_TAG_DIV 0, __MIME_HTML 0, __MIME_TEXT_H 0, __MIME_TEXT_H1 0, __MIME_TEXT_H2 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MIME_TEXT_P2 0, __MIME_VERSION 0, __RDNS_WEBMAIL 0, __SANE_MSGID 0, __SUBJ_ALPHA_START 0, __TO_MALFORMED_2 0, __TO_NO_NAME 0, __URI_IN_BODY 0, __URI_NOT_IMG 0, __URI_NO_MAILTO 0, __URI_NO_PATH 0, __URI_NO_WWW 0, __URI_NS_NXDOMAIN , __URI_WITHOUT_PATH 0, __X_GOOGLE_DKIM_SIGNATURE 0, __YOUTUBE_RCVD 0, __zen.spamhaus.org_ERROR '
Date: Tue, 7 May 2019 12:08:11 -0400
X-Spam-Level:
X-PMX-Version: 5.6.1.2065439, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2019.5.7.160018
To: autrijus [...] autrijus.org, bug-Template-Generate [...] rt.cpan.org
From: Jack Langsdorf <jacklangsdorf [...] gmail.com>
X-GM-Message-State: APjAAAURmZHqTX6FA3nCnjerChEx1uCsVtj3qOudfWYvigsB7RWjqXRP l8AN6TFNiv2RUIuRTWnTakHqmFOAcMZGOTH2za2ArReW
X-RT-Interface: Email
Content-Length: 0
Content-Type: multipart/alternative; boundary="00000000000017996105884e6d71"
Content-Length: 0
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
Content-Length: 2310
Download (untitled) / with headers
text/plain 2.2k
Hi! I wrote some code that gives simple but notrivial generation of [% ... %] in Template::Generate. My concept is that every fixed string of length > 1 in the template is potentially replaced with the combination of a prefix, a [% ... %], and a suffix. The prefix and suffix match for all cases. The diff is attached. Handling [% ... %] makes Template::Generate much more powerful when it is being used to build a web scraping template - you no longer need to work out all of the pieces of data that were used to generate the original page. Given a web page with a list of items with Template style formatting, if you identify the data you want to grab from two of them, the script can find the common template (ignoring other junk in each listing) and then you can push that template back into Template::Extract to extract the data from the entire list. See the attached example file. (You do have to contribute the strings that are associated with FOREACH and END manually.) ALSO, I noticed that Generate seems to sometimes miss cases if one of the data items appears multiple times in the text, but the desired template needs to ignore one case of the data item. In the attached generate_and_extract.pl, if you search for google rather than slashdot, it fails to find the template needed. All of the suggested templates have [% rate %] before [% url %], because it picks up the unrelated A+ given to slashdot.) The case that we need is like the 0th case, but deleting everything before the first ". 0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% comment %].[% ... %] [% rate %]' 1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% rate %] [% ... %] [% comment %]' 2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% comment %]' 3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... %] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]' I will see if I can find a fix for that bug. Also, I notice that it is pretty slow when I run on large documents, like 1000 lines of html code. I will poke around and see if maybe there is a faster way to implement it, perhaps using the index function rather than regex. So I may send you another note at some point. - Jack Langsdorf
content-type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 6567
X-Attachment-ID: f_jvdyhr2m0
content-type: text/plain; charset="utf-8"; name="generate.diffs.txt"
Content-Disposition: attachment; filename="generate.diffs.txt"
Content-Transfer-Encoding: base64
Content-ID: <f_jvdyhr2m0>
X-RT-Original-Encoding: ascii
Content-Length: 3337
Download generate.diffs.txt
text/plain 3.2k

Message body is not shown because sender requested not to inline it.

X-Attachment-ID: f_jvdyml9c1
Content-Type: text/x-perl-script; charset="US-ASCII"; name="generate_and_extract.pl"
Content-Disposition: attachment; filename="generate_and_extract.pl"
Content-Transfer-Encoding: base64
Content-ID: <f_jvdyml9c1>
X-RT-Original-Encoding: ascii
Content-Length: 1222

Message body is not shown because sender requested not to inline it.

MIME-Version: 1.0
X-Spam-Status: No, score=-2.535 tagged_above=-99.9 required=10 tests=[AWL=1.994, BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_FROM=0.001, FROM_OUR_RT=-4, HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_SOFTFAIL=0.665, T_HTML_ATTACH=0.01, URI_TRY_3LD=0.001] autolearn=ham
In-Reply-To: <rt-4.0.18-7745-1557245311-1995.129481-3-0 [...] rt.cpan.org>
X-Cpan.org: This message routed through the cpan.org mail forwarding service. Please use PAUSE pause.perl.org to configure your delivery settings.
X-Spam-Flag: NO
X-RT-Interface: API
References: <RT-Ticket-129481 [...] rt.cpan.org> <CAK_WbYgW8BuwE+ka7DJNwiS9jii=5dJTiXPPFH13A9EPDb2ixA [...] mail.gmail.com> <rt-4.0.18-7745-1557245311-1995.129481-3-0 [...] rt.cpan.org>
X-Virus-Scanned: Debian amavisd-new at bestpractical.com
X-Received: by 2002:a5d:4fd2:: with SMTP id h18mr23764184wrw.117.1557861032783; Tue, 14 May 2019 12:10:32 -0700 (PDT)
Message-ID: <CAK_WbYhOVAMycSpccWDrrp1Q4Q1u=5euYxswmp+_RX6vBh_Uow [...] mail.gmail.com>
Content-Type: multipart/mixed; boundary="0000000000007817700588ddc9a0"
X-Spam-Score: -2.535
X-Google-SMTP-Source: APXvYqxO8ZKHuPws3n5UZAD3E4dk4GWarpU8Jb3yKy2vcsJ23AjoyAaxBoYAmWODQvxFm0mnwRWrB3AedM5Xx+dX9jw=
Authentication-Results: hipster.bestpractical.com (amavisd-new); dkim=pass header.i= [...] gmail.com
Received: from localhost (localhost [127.0.0.1]) by hipster.bestpractical.com (Postfix) with ESMTP id 6662D240269 for <cpan-bug+Template-Generate [...] hipster.bestpractical.com>; Tue, 14 May 2019 15:11:05 -0400 (EDT)
Received: from hipster.bestpractical.com ([127.0.0.1]) by localhost (hipster.bestpractical.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id gDUoDb4QpE7L for <cpan-bug+Template-Generate [...] hipster.bestpractical.com>; Tue, 14 May 2019 15:10:54 -0400 (EDT)
Received: from xx1.develooper.com (unknown [147.75.38.233]) by hipster.bestpractical.com (Postfix) with ESMTPS id DD03C2400B0 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 14 May 2019 15:10:53 -0400 (EDT)
Received: from localhost (xx1.develooper.com [127.0.0.1]) by localhost (Postfix) with ESMTP id DE4B17C1B2 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 14 May 2019 12:10:52 -0700 (PDT)
Received: from xx1.develooper.com (xx1.develooper.com [127.0.0.1]) by localhost (Postfix) with SMTP id AA7087C1A8 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 14 May 2019 12:10:37 -0700 (PDT)
Received: from mail-wr1-f67.google.com (mail-wr1-f67.google.com [209.85.221.67]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by xx1.develooper.com (Postfix) with ESMTPS id D64B97C1B3 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 14 May 2019 12:10:33 -0700 (PDT)
Received: by mail-wr1-f67.google.com with SMTP id b18so10448090wrq.12 for <bug-Template-Generate [...] rt.cpan.org>; Tue, 14 May 2019 12:10:33 -0700 (PDT)
Delivered-To: cpan-bug+Template-Generate [...] hipster.bestpractical.com
Subject: Re: [rt.cpan.org #129481] AutoReply: code to handle [% ... %] in Template::Generate
Return-Path: <jacklangsdorf [...] gmail.com>
Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=8UGYErEmktF6cLnVZ/xMX5pi/k+twUyvR9q00oSDo4I=; b=SDmgYlkG2TCxbwoTKKpadiFvXIf1gPVbxaFLvvZ7j3MsJjz0iG4jgFdBkRhz35EPqG gHNIyuLiDNOapgIgRqOZ23UPF8p6ZGxqMRoR5hiqlNRLzsy9BHxGhNOGcolt5cKIwrGc VPRuH+FenRGHyC8Q6LZfw/GM5OUfzsLwKccAr3KgAaqiiLdmwmHVVeDVHyklbK4Tp5zo poKsp9ntGWwoapvK3gb4sL++nOEVTIp+JYfgPbEleAi0csE3k8wa/hCuPT/2qI4QwACp qkj3yhGhklPSRkT234wccWKZHLK9ktz93MB3VjJ3hdqnlWBaohDbKXeMyinejHZZir1w nKmg==
X-Original-To: cpan-bug+Template-Generate [...] hipster.bestpractical.com
X-RT-Mail-Extension: template-generate
X-Google-Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=8UGYErEmktF6cLnVZ/xMX5pi/k+twUyvR9q00oSDo4I=; b=OonJLyW+dQpKQXk24lWfIiyPOXsF1Otut5vCgEYQ6E53F4vV25qWZweyLpTP8TTJtx fFhoz6wA+U8+kW0BA72IXYvrr0r2Y4yZMWHOYOKf5wWEaCyhI/CpJ6hX7nUnw0UTOlO8 s4AfRRfEgSsB9cRLyxXU3ZuJO2kIwXfCMVLi50i5B5vPU8z5QT0JRDAX1VhgaTu5mcw/ K6zK2s+sf5EpyeDN8OrFrVMtSwlKME9oWUcSaAJucfSX3K3LO6cOUUQ/IICdmWT9uZpZ NVpXrW0sqtPl8hiTx7uUYUKOHby3CdvzLgSCSG+xw5gokUh5PqmHlDZVygIAiBcTEDdz pcCA==
Date: Tue, 14 May 2019 15:10:21 -0400
X-PMX-Spam: Gauge=XIIII, Probability=14%, Report=' FRAUD_X5_LARGE_BODY -2, FRAUD_X3_LARGE_BODY -1, FRAUD_X4_LARGE_BODY -1, FRAUD_X4_REF -1, FRAUD_X5 3, FRAUD_X3 1, FRAUD_X4 1, BASE64_ENC_TEXT 0.5, IMGSPAM_BODY 0.5, HTML_50_70 0.1, INVOICE_ATTACHMENT 0.1, FRAUD_ATTACH 0.05, KNOWN_FREEWEB_URI 0.05, SUPERLONG_LINE 0.05, BODYTEXTH_SIZE_10000_LESS 0, BODYTEXTH_SIZE_3000_MORE 0, BODY_SIZE_10000_PLUS 0, BOUNCE_AUTORESP 0, BOUNCE_GENERIC 0, CHILD_EX_X3 0, DATE_TZ_NA 0, DKIM_ALIGNS 0, DKIM_SIGNATURE 0, HREF_LABEL_TEXT_ONLY 0, HTML_ATTACHED 0, HTML_ATTACHED_JS 0, IN_REP_TO 0, LEGITIMATE_SIGNS 0, LINK_TO_IMAGE 0, MSG_THREAD 0, REFERENCES 0, SCRIPT_ATTACHED 0, SPF_PASS 0, STYLE_RATWARE_REF 0, URI_ENDS_IN_HTML 0, URI_ENDS_IN_PHP 0, WEBMAIL_SOURCE 0, __ANY_URI 0, __BODY_TEXT_X4 0, __BOUNCE_AUTORESP_SUBJ 0, __BOUNCE_CHALLENGE_SUBJ 0, __BOUNCE_NDR_SUBJ_EXEMPT 0, __COMPANY_TWITTER 0, __CP_MEDIA_BODY 0, __CP_NAME_BODY 0, __CP_URI_IN_BODY 0, __CT 0, __CTYPE_HAS_BOUNDARY 0, __CTYPE_MULTIPART 0, __CTYPE_MULTIPART_MIXED 0, __DKIM_ALIGNS_1 0, __DKIM_ALIGNS_2 0, __DQ_NEG_HEUR 0, __DQ_NEG_IP 0, __FORWARDED_MSG 0, __FRAUD_BADTHINGS 0, __FRAUD_COMMON 0, __FRAUD_CONTACT_ADDY 0, __FRAUD_MONEY 0, __FRAUD_MONEY_BIG_COIN 0, __FRAUD_MONEY_BIG_COIN_DIG 0, __FRAUD_MONEY_CURRENCY 0, __FRAUD_MONEY_CURRENCY_DOLLAR 0, __FRAUD_MONEY_DENOMINATION 0, __FRAUD_MONEY_VALUE 0, __FRAUD_REFNUM 0, __FRAUD_WEBMAIL 0, __FRAUD_WEBMAIL_FROM 0, __FRAUD_WINNER 0, __FROM_GMAIL 0, __FUR_RDNS_GMAIL 0, __HAS_APPLE_URI 0, __HAS_ATTACHMENT 0, __HAS_ATTACHMENT1 0, __HAS_FROM 0, __HAS_HTML 0, __HAS_MSGID 0, __HAS_REFERENCES 0, __HELO_GMAIL 0, __HEX28_LC_BOUNDARY 0, __HIGHBITS 0, __HREF_LABEL_TEXT 0, __HREF_LABEL_URI 0, __HTML_AHREF_TAG 0, __HTML_EXT_ATTACHED 0, __HTML_TAG_DIV 0, __HTM_ATTACHED 0, __HTM_ATTACHED_JS 0, __HTTPS_URI 0, __HTTP_IMAGE_TAG 0, __IMGSPAM_BODY 0, __INT_PROD_LOC 0, __INVOICE_MULTILINGUAL 0, __IN_REP_TO 0, __KNOWN_FREEWEB_URI2 0, __LINK_TO_AMAZON 0, __MAL_TELEKOM_URI 0, __MIME_HTML 0, __MIME_TEXT_H 0, __MIME_TEXT_H1 0, __MIME_TEXT_H2 0, __MIME_TEXT_P 0, __MIME_TEXT_P1 0, __MIME_TEXT_P2 0, __MIME_VERSION 0, __MULTIPLE_URI_HTML 0, __MULTIPLE_URI_TEXT 0, __OEM_PRICE 0, __PHISH_PHRASE_NL4 0, __PHISH_SPEAR_ACCOUNT_1 0, __PHISH_SPEAR_DETAILS 0, __PHISH_SPEAR_FORM_URI 0, __RATWARE_SIGNATURE_3_N1 0, __RDNS_WEBMAIL 0, __REFERENCES 0, __SANE_MSGID 0, __STOCK_PHRASE_1 0, __STOCK_PHRASE_7 0, __STYLE_RATWARE 0, __STYLE_RATWARE_NEG 0, __SUBJ_ALPHA_NEGATE 0, __SUBJ_REPLY 0, __TAG_EXISTS_HTML 0, __TO_MALFORMED_2 0, __TO_NO_NAME 0, __URI_IN_BODY 0, __URI_MULTIPLE_SUBDOMAINS 0, __URI_NOT_IMG 0, __URI_NS , __URI_NS_NXDOMAIN , __URI_WITHOUT_PATH 0, __URI_WITH_PATH 0, __X_GOOGLE_DKIM_SIGNATURE 0, __YOUTUBE_RCVD 0, __zen.spamhaus.org_ERROR '
X-Spam-Level:
X-PMX-Version: 5.6.1.2065439, Antispam-Engine: 2.7.2.376379, Antispam-Data: 2019.5.14.190017
To: bug-Template-Generate [...] rt.cpan.org
X-GM-Message-State: APjAAAU9/QwxCwt3UZWZSbg1H9+5skkG3Z7l0+kxAqHIB+eJvKQhibV1 iQYFxsqJ6zhMkg2tLJnju8j3KREgHCvBxFRIcywpLg==
From: Jack Langsdorf <jacklangsdorf [...] gmail.com>
RT-Message-ID: <rt-4.0.18-19082-1557861066-1568.129481-0-0 [...] rt.cpan.org>
Content-Length: 0
Content-Type: multipart/alternative; boundary="00000000000078176c0588ddc99e"
Content-Length: 0
content-type: text/plain; charset="utf-8"
X-RT-Original-Encoding: utf-8
Content-Length: 4056
Download (untitled) / with headers
text/plain 3.9k
Hello - I rewrote the entire module, which I submit for your review. This version uses index/substr instead of regex. It handles [% ... %] correctly, and also handles cases where one of the data items appears more than once correctly. Compared to my previous submission on this bug which added the [% ... %] handling, this version is 400x faster on my large testcase (from 2.5 minutes to 0.33 seconds). In the testcase, I supply data for two items in a list found on a webpage (xkcd_blag.htm) and then use Template::Generate to generate the template for those items, then use Template::Extract to recover the full list. (During testing I renamed the module to Generate2.pm and Generate3.pm for comparison). - Jack Langsdorf On Tue, May 7, 2019 at 12:08 PM Bugs in Template-Generate via RT < bug-Template-Generate@rt.cpan.org> wrote: Show quoted text
> > Greetings, > > This message has been automatically generated in response to the > creation of a trouble ticket regarding: > "code to handle [% ... %] in Template::Generate", > a summary of which appears below. > > There is no need to reply to this message right now. Your ticket has been > assigned an ID of [rt.cpan.org #129481]. Your ticket is accessible > on the web at: > > https://rt.cpan.org/Ticket/Display.html?id=129481 > > Please include the string: > > [rt.cpan.org #129481] > > in the subject line of all future correspondence about this issue. To do > so, > you may reply to this message. > > Thank you, > bug-Template-Generate@rt.cpan.org > > ------------------------------------------------------------------------- > Hi! > > I wrote some code that gives simple but notrivial generation of [% ... %] > in Template::Generate. > > My concept is that every fixed string of length > 1 in the template is > potentially replaced with the combination of a prefix, a [% ... %], and a > suffix. The prefix and suffix match for all cases. > > The diff is attached. > > Handling [% ... %] makes Template::Generate much more powerful when it is > being used to build a web scraping template - you no longer need to work > out all of the pieces of data that were used to generate the original > page. Given > a web page with a list of items with Template style formatting, if you > identify the data you want to grab from two of them, the script can find > the common template (ignoring other junk in each listing) and then you can > push that template back into Template::Extract to extract the data from the > entire list. See the attached example file. (You do have to contribute the > strings that are associated with FOREACH and END manually.) > > ALSO, I noticed that Generate seems to sometimes miss cases if one of the > data items appears multiple times in the text, but the desired template > needs to ignore one case of the data item. In the attached > generate_and_extract.pl, if you search for google rather than slashdot, it > fails to find the template needed. All of the suggested templates have [% > rate %] before [% url %], because it picks up the unrelated A+ given to > slashdot.) The case that we need is like the 0th case, but deleting > everything before the first ". > > 0 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %].[% ... %] [% rate %]' > > 1 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %]' > > 2 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% comment %]' > > 3 '[% ... %] [% rate %] [% ... %]"[% url %]"[% ... %]>[% title %]<[% ... > %] [% rate %] [% ... %] [% comment %].[% ... %] [% rate %]' > > > I will see if I can find a fix for that bug. > > > Also, I notice that it is pretty slow when I run on large documents, like > 1000 lines of html code. I will poke around and see if maybe there is a > faster way to implement it, perhaps using the index function rather than > regex. So I may send you another note at some point. > > > - Jack Langsdorf >
content-type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
X-RT-Original-Encoding: utf-8
Content-Length: 5184
content-type: text/html; charset="utf-8"; name="xkcd_blag.htm"
X-Attachment-ID: f_jvo64h6k2
Content-Disposition: attachment; filename="xkcd_blag.htm"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: utf-8
Content-ID: <f_jvo64h6k2>
Content-Length: 87529
Download xkcd_blag.htm
text/html 85.4k

Message body is not shown because sender requested not to inline it.

Content-Type: text/x-perl; charset="US-ASCII"; name="Generate.pm"
X-Attachment-ID: f_jvo62waz0
Content-Disposition: attachment; filename="Generate.pm"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: ascii
Content-ID: <f_jvo62waz0>
Content-Length: 7796
Download Generate.pm
text/x-perl 7.6k

Message body is not shown because sender requested not to inline it.

Content-Type: text/x-perl; charset="US-ASCII"; name="template_generate_big_example.pl"
X-Attachment-ID: f_jvo64a3y1
Content-Disposition: attachment; filename="template_generate_big_example.pl"
Content-Transfer-Encoding: base64
X-RT-Original-Encoding: ascii
Content-ID: <f_jvo64a3y1>
Content-Length: 891

Message body is not shown because sender requested not to inline it.



This service is sponsored and maintained by Best Practical Solutions and runs on Perl.org infrastructure.

Please report any issues with rt.cpan.org to rt-cpan-admin@bestpractical.com.