Wednesday, October 31, 2012

Wiktionary - Recent changes [en]: User talk:Ruakh

Wiktionary - Recent changes [en]
Track the most recent changes to the wiki in this feed.
User talk:Ruakh
Oct 31st 2012, 15:39

Dumps:

← Older revision Revision as of 15:39, 31 October 2012
Line 988: Line 988:
 
: Re: "I am trying to generate a list of all transclusions of {{temp|recons}}": In Perl, I would probably write something like <tt>perl -nwe 'BEGIN { $/ = "</page>\n" } next unless m/{{\s*recons\s*\|/; die unless m{<title>([^<]+)</title>}; my $title = $1; die unless m{<text xml:space="preserve">([^<]+)</text>} or m{<text xml:space="preserve" />()}; $_ = $1; print "$title\t$1\n" while m/({{\s*recons\s*\|.*)/g' &lt; enwiktionary-20121021-pages-articles.xml &gt; uses-of-recons.txt</tt> so I had a small working set, and then do whatever further analysis I wanted. —[[User: Ruakh |Ruakh]] 15:01, 31 October 2012 (UTC)
 
: Re: "I am trying to generate a list of all transclusions of {{temp|recons}}": In Perl, I would probably write something like <tt>perl -nwe 'BEGIN { $/ = "</page>\n" } next unless m/{{\s*recons\s*\|/; die unless m{<title>([^<]+)</title>}; my $title = $1; die unless m{<text xml:space="preserve">([^<]+)</text>} or m{<text xml:space="preserve" />()}; $_ = $1; print "$title\t$1\n" while m/({{\s*recons\s*\|.*)/g' &lt; enwiktionary-20121021-pages-articles.xml &gt; uses-of-recons.txt</tt> so I had a small working set, and then do whatever further analysis I wanted. —[[User: Ruakh |Ruakh]] 15:01, 31 October 2012 (UTC)
 
:: I'm using the XmlReader.py script that is part of the PyWikipediabot package that I use for all bot work. It uses a streaming parser, so it can parse entries in the dump as they are requested by the script - Python has a feature called "yield" for this, which allows a function to generate new elements of a list as they are iterated over. It returns them to me as whole pages with metadata already parsed, which is convenient. However, for it to iterate through every page in the dump takes several minutes. I believe that it uncompresses the file on the fly, so I could try uncompressing it myself first to see what happens. {{User:CodeCat/signature}} 15:16, 31 October 2012 (UTC)
 
:: I'm using the XmlReader.py script that is part of the PyWikipediabot package that I use for all bot work. It uses a streaming parser, so it can parse entries in the dump as they are requested by the script - Python has a feature called "yield" for this, which allows a function to generate new elements of a list as they are iterated over. It returns them to me as whole pages with metadata already parsed, which is convenient. However, for it to iterate through every page in the dump takes several minutes. I believe that it uncompresses the file on the fly, so I could try uncompressing it myself first to see what happens. {{User:CodeCat/signature}} 15:16, 31 October 2012 (UTC)
  +
::: I tried it and compared the times. Uncompressed, iterating through the dump without any processing whatsoever takes 273 seconds, and when the dump is compressed it takes 480 seconds. So there is a significant difference, but still nowhere near the 30 seconds that your script achieved, probably because it does a lot of extra parsing to include the metadata. Unfortunately I can't read Perl code, so can you explain what steps your code above does so that I can recreate it? (Note that I am not just looking for pages that transclude {{temp|recons}}, I want to extract the parameters of each invocation too, so that I can build a list of which reconstructed terms are being linked to and from which pages) {{User:CodeCat/signature}} 15:39, 31 October 2012 (UTC)

You are receiving this email because you subscribed to this feed at blogtrottr.com.

If you no longer wish to receive these emails, you can unsubscribe from this feed, or manage all your subscriptions