| : Re: "I am trying to generate a list of all transclusions of {{temp|recons}}": In Perl, I would probably write something like <tt>perl -nwe 'BEGIN { $/ = "</page>\n" } next unless m/{{\s*recons\s*\|/; die unless m{<title>([^<]+)</title>}; my $title = $1; die unless m{<text xml:space="preserve">([^<]+)</text>} or m{<text xml:space="preserve" />()}; $_ = $1; print "$title\t$1\n" while m/({{\s*recons\s*\|.*)/g' < enwiktionary-20121021-pages-articles.xml > uses-of-recons.txt</tt> so I had a small working set, and then do whatever further analysis I wanted. —[[User: Ruakh |Ruakh]] 15:01, 31 October 2012 (UTC) | | : Re: "I am trying to generate a list of all transclusions of {{temp|recons}}": In Perl, I would probably write something like <tt>perl -nwe 'BEGIN { $/ = "</page>\n" } next unless m/{{\s*recons\s*\|/; die unless m{<title>([^<]+)</title>}; my $title = $1; die unless m{<text xml:space="preserve">([^<]+)</text>} or m{<text xml:space="preserve" />()}; $_ = $1; print "$title\t$1\n" while m/({{\s*recons\s*\|.*)/g' < enwiktionary-20121021-pages-articles.xml > uses-of-recons.txt</tt> so I had a small working set, and then do whatever further analysis I wanted. —[[User: Ruakh |Ruakh]] 15:01, 31 October 2012 (UTC) |
| :: I'm using the XmlReader.py script that is part of the PyWikipediabot package that I use for all bot work. It uses a streaming parser, so it can parse entries in the dump as they are requested by the script - Python has a feature called "yield" for this, which allows a function to generate new elements of a list as they are iterated over. It returns them to me as whole pages with metadata already parsed, which is convenient. However, for it to iterate through every page in the dump takes several minutes. I believe that it uncompresses the file on the fly, so I could try uncompressing it myself first to see what happens. {{User:CodeCat/signature}} 15:16, 31 October 2012 (UTC) | | :: I'm using the XmlReader.py script that is part of the PyWikipediabot package that I use for all bot work. It uses a streaming parser, so it can parse entries in the dump as they are requested by the script - Python has a feature called "yield" for this, which allows a function to generate new elements of a list as they are iterated over. It returns them to me as whole pages with metadata already parsed, which is convenient. However, for it to iterate through every page in the dump takes several minutes. I believe that it uncompresses the file on the fly, so I could try uncompressing it myself first to see what happens. {{User:CodeCat/signature}} 15:16, 31 October 2012 (UTC) |
| + | ::: I tried it and compared the times. Uncompressed, iterating through the dump without any processing whatsoever takes 273 seconds, and when the dump is compressed it takes 480 seconds. So there is a significant difference, but still nowhere near the 30 seconds that your script achieved, probably because it does a lot of extra parsing to include the metadata. Unfortunately I can't read Perl code, so can you explain what steps your code above does so that I can recreate it? (Note that I am not just looking for pages that transclude {{temp|recons}}, I want to extract the parameters of each invocation too, so that I can build a list of which reconstructed terms are being linked to and from which pages) {{User:CodeCat/signature}} 15:39, 31 October 2012 (UTC) |