Comparing big lists
Dave Cragg
dcragg at lacscentre.co.uk
Mon Apr 29 08:46:01 EDT 2002
At 6:48 pm -0700 28/4/02, erik hansen wrote:
>how about a short example?
Not so short. :)
Here's a "useless" example to illustrate the cost of incrementing
chunk expressions in a long loop vs using "repeat for each". It
simply copies a list, first using chunk expressions, next using
"repeat for each" Vary the value of tListSize to see how the first
method gets exponentially slower, whereas the second method's time
grows proportionally.
on mouseUp
#build a list of numbers
put 1000 into tListSize
repeat tListSize
put random(1000000) & cr after tList
end repeat
delete char -1 of tList
put empty into tOutList
#test 1
put the milliseconds into tStart
put the number of lines of tList into tNumLines
repeat with i = 1 to tNumLines
put line i of tList & cr after tOutList
end repeat
delete char -1 of tOutlist
put the milliseconds - tStart into tTime
put "Test 1:" && tTime into tResult
put empty into tOutList
#test 2
put the milliseconds into tStart
repeat for each line tLine in tList
put tLine & cr after tOutList
end repeat
delete char -1 of tOutlist
put the milliseconds - tStart into tTime
put cr & "Test 2:" && tTime after tResult
answer tResult
end mouseUp
Here's an example that "kind of" deals with Greg's original question.
This builds two lists of numbers, one of two columns and one of 6. It
rigs the columns to be compared by prepending and appending an "x".
This prevents lineOffset in the first method finding the value in the
wrong column, and avoids the problem of substrings matching. The
value in the second column of the small list is unique. (I take it
from Greg's mail that his data was something like this, but in
general, you'll need to do more work with both methods.)
The first method uses lineOffset in a loop. The second method builds
an array from one list, and then loops through the second list line
at a time checking if there is an array entry for the item being
compared. The time difference is substantial.
This doesn't mean there isn't a place for using the offset functions.
For example, if searching how many times a single word occurs in some
text, it works fine, and there's no overhead of building an array.
But when comparing two large lists, it can get real slow.
on mouseUp
#build two lists for testing
put 10000 into tLargeListSize
put 1000 into tSmallListSize
repeat tLargeListSize
put empty into tLargeLine
repeat 6
put random(1000000) & comma after tLargeLine
end repeat
put "x" before item 3 of tLargeLine
put "x" after item 3 of tLargeLine
delete char -1 of tLargeLine
put tLargeLine & cr after tLargeList
end repeat
delete char -1 of tLargeList
repeat with i = 1 to tSmallListSize
put empty into tSmallLine
put random(10000) & comma & "x" & i & "x" after tSmallLine
put tSmallLine & cr after tSmallList
end repeat
delete char -1 of tSmallList
sort lines of tSmallList by item 1 of each ##random sort
#start test
#test 1 -- using lineOffset
put the milliseconds into tStart
----------
repeat for each line j in tLargeList
put lineOffset(item 3 of j, tSmallList) into tOff
if tOff is not 0 then
put line tOff of tSmallList & ":" & j & cr after tMergeList1
end if
end repeat
delete char -1 of tMergeList1
------------
put the milliseconds - tStart into tTime
put "Test 1:" && tTime into tResult
#test 2 -- using an array
put the milliseconds into tStart
-----------------
#build array of item 2 of small list
repeat for each line tLine in tSmallList
put tLine into tArray[item 2 of tLine] ##to make tests comparable
end repeat
repeat for each line tLine in tLargeList
if tArray[item 3 of tLine] <> empty then
put tArray[item 3 of tLine] & ":" & tLine & cr after tMergeList2
end if
end repeat
delete char -1 of tMergeList2
---------------------
put the milliseconds - tStart into tTime
put cr & "Test 2:" && tTime after tResult
#put tMergeList1 into field 1
#put tMergeList2 into field 2
answer tResult
end mouseUp
Cheers
Dave Cragg
More information about the metacard
mailing list