Comparing Big Lists
Gregory Lypny
gregory.lypny at videotron.ca
Mon Apr 29 13:25:01 EDT 2002
Hi Scott,
I tried your suggestion of turning smallList into an associative
array with the index for each element equal to the text I'm looking for
in bigList. I think I must have misunderstood your suggestion because
the handler runs much slower than previously, perhaps because I've got
it asking for the keys of smallList for every line of bigList. Here's
what I tried.
-- Note. smallListArray array is an array made out of the original
smallList variable
repeat for each line i in bigList
if item 6 of i keys(smallListArray)
then
put i into hitList[item 6 of i]
end if
end repeat
Message: 3
Subject: Re: Comparing big lists
Date: Sat, 27 Apr 2002 16:10:42 -0400
From: Gregory Lypny <gregory.lypny at videotron.ca>
To: "MetaCard List" <metacard at lists.runrev.com>
Reply-To: metacard at lists.runrev.com
Thanks for the suggestion, Scott. I'll give it a shot. I've also tried
looping over the lines of bigList (i.e., a nested repeat), simply using
the 'in' operator: if x is in y, then... It takes about 6 minutes on a
modest (300 mHz) iBook running OS X, but I'm hoping for an improvement,
Regards,
Greg
On 27/4/2002 12:08 PM, metacard-request at lists.runrev.com wrote:
Message: 2
Date: Fri, 26 Apr 2002 12:48:53 -0600 (MDT)
From: Scott Raney <raney at metacard.com>
To: metacard at lists.runrev.com
Subject: Re: Comparing big lists
Reply-To: metacard at lists.runrev.com
On: Thu, 25 Apr 2002 Gregory Lypny <gregory.lypny at videotron.ca> wrote:
Thought I would pick your brains on the topic of comparing two big
lists. Both are tab delimited. bigList has about 100,000 lines and
6 items (columns) per line. smallList is about 15,000 lines and 2
items per line. I want to identify the lines in bigList in which
the third item is the same as the second item in a line in
smallList, and then pull out the intersection. I used something
like this, which works fine.
set the itemDelimiter to tab
repeat for each line j of smallList
put lineOffset(item 2 of j, bigList) into thisLine
if thisLine is not 0 then put j & tab & \
line thisLine of bigList & return after
mergedList
end repeat
delete last character of mergedList -- Get rid of the trailing
Return
Using the lineOffset function seemed the obvious choice to me, but I'm
also interested in other approaches.
LineOffset on such a big variable is going to be pretty expensive.
Another option would be to us split to build an array out of smallList
and the loop over each line in big list and see if there is an array
index for it. Split takes awhile and will use up a good bit of
memory, but makes the lookups *much* faster. You could save some of
that space by building up an array of just the relevant items in one
list or the other by looping over the lines and creating one array
index for each.
Regards,
Scott
Regards,
Greg
More information about the metacard
mailing list