the large file challenge
Yennie at aol.com
Yennie at aol.com
Sun Nov 10 18:54:01 EST 2002
For mine... if you are not concerned so much about the exact number then
change:
"read from file the_file for numLines lines"
to
"read from file the_file for numLines"
for a big speedup.
and
up the chunkSize to something closer to your available memory, for example
(8*1024*1024) = 8MB.
As for innacurate results, it looks like they are roughly double in my case.
Any chance "mystic_mouse" appears more than once on a line?
That would probably cause it- didn't think of that without seeing the file
=).
Of, of course, I could have made an error somewhere (nah!).
Here's a modified script, which attempts to fix the supposed multiple matches
per line:
Try calling it with parameters such as 2,4,8,16,32, etc (these are now in
MB)::
#!/usr/local/bin/mc
on startup
## initialize variables: try adjusting numLines
put "/gig/tmp/log/access_log" into the_file
put ($1*1024) into chunkSize ## parameter is now number of MB
put 0 into counter
open file the_file
repeat until (isEOF = TRUE)
## read the specified number of lines, check if we are at the end of the
file
read from file the_file for chunkSize
put (it&cr) into thisChunk
put (the result = "eof") into isEOF
## count the number of matches in this chunk
put offset("mystic_mouse", thisChunk) into theOffset
repeat until (theOffset = 0)
add 1 to counter
put offset("mystic_mouse", thisChunk, theOffset) into tempOffset
if (tempOffset > 0) then
add tempOffset to theOffset
add offset(return, thisChunk, theOffset) to theOffset
else put 0 into theOffset
end repeat
end repeat
close file the_file
put counter
end startup
HTH.
Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.runrev.com/pipermail/metacard/attachments/20021110/00ca3507/attachment.htm
More information about the metacard
mailing list