Comparing big lists
Ben Rubinstein
benr at cogapp.com
Mon Apr 29 04:03:01 EDT 2002
on 27/4/02 9:10 PM, Gregory Lypny at gregory.lypny at videotron.ca wrote:
> Thanks for the suggestion, Scott. I'll give it a shot. I've also tried
> looping over the lines of bigList (i.e., a nested repeat), simply using
> the 'in' operator: if x is in y, then... It takes about 6 minutes on a
> modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,
>
I wrote a general version of exactly this recently (take two tab and return
files; from each select one column to match on; from each select columns to
output; then choose to output the merged file, and/or the discards from one
or other file). My first version used a 'clever' function based on
lineoffset to locate all the matching lines. With inputs of 50,000 and
10,000 lines, it took about 3 minutes on a 400Mhz TiBook (OS9). Then I
rewrote it to put one file into an array indexed on the requested column, eg
set the itemdelimiter to tab
put empty into srcDataBarray
repeat for each line r in srcDataB
put r into srcDataBarray[(item LinkColB of r)]
end repeat
and loop through the other file (using repeat for each, of course, not
repeat with a line number) testing against the array. It went down to a few
seconds. Then I took out the progress feedback - now it is consistently 1
second or less. It's so fast that the difference between indexing the
'small' and 'large' files is undetectable; and it does all the extra stuff
(of making up all three lists, the merged and the two discard sets) every
time, only checking at the end which of the three I actually asked it to
save.
The combination of 'repeat for each' and MC/Rev's hashed arrays is just
blinding. A fantastic illustration of why a fourth generation language can
not only give fast development, but also fast execution.
Ben Rubinstein | Email: benr_mc at cogapp.com
Cognitive Applications Ltd | Phone: +44 (0)1273-821600
http://www.cogapp.com | Fax : +44 (0)1273-728866
More information about the metacard
mailing list