the large file challenge

Yennie at aol.com Yennie at aol.com
Sun Nov 10 18:54:01 EST 2002


For mine... if you are not concerned so much about the exact number then 
change:

"read from file the_file for numLines lines"
to
"read from file the_file for numLines"

for a big speedup.

and

up the chunkSize to something closer to your available memory, for example 
(8*1024*1024) = 8MB.

As for innacurate results, it looks like they are roughly double in my case.
Any chance "mystic_mouse" appears more than once on a line?

That would probably cause it- didn't think of that without seeing the file 
=).

Of, of course, I could have made an error somewhere (nah!).

Here's a modified script, which attempts to fix the supposed multiple matches 
per line:
Try calling it with parameters such as 2,4,8,16,32, etc (these are now in 
MB)::


#!/usr/local/bin/mc
on startup
  ## initialize variables: try adjusting numLines
  put "/gig/tmp/log/access_log" into the_file
  put ($1*1024) into chunkSize ## parameter is now number of MB
  put 0 into counter

  open file the_file

  repeat until (isEOF = TRUE)
     ## read the specified number of lines, check if we are at the end of the 
file
     read from file the_file for chunkSize
     put (it&cr) into thisChunk
     put (the result = "eof") into isEOF

     ## count the number of matches in this chunk
     put offset("mystic_mouse", thisChunk) into theOffset
     repeat until (theOffset = 0)
        add 1 to counter
        put offset("mystic_mouse", thisChunk, theOffset) into tempOffset
        if (tempOffset > 0) then 
            add tempOffset to theOffset
            add offset(return, thisChunk, theOffset) to theOffset
        else put 0 into theOffset
     end repeat

  end repeat

  close file the_file

  put counter
end startup

HTH.
Brian


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.runrev.com/pipermail/metacard/attachments/20021110/00ca3507/attachment.htm


More information about the metacard mailing list