Another Large File Import (OpenInsight 32-Bit) [Revelation On-Line Wiki]

OpenInsight 32-Bit, Paxton Scott, Paul Rule, Colin Rule, Bob Carten, //www.srpcs.com]SRP[/url]'s Kevin Fournier, Warren Auyong

Join The Works program to have access to the most current content, and to be able to ask questions and get answers from Revelation staff and the Revelation community

At 07 SEP 2008 02:54:05PM Paxton Scott wrote:

Greetings!

I was about to post mine, and have read with great interest all the responses to Colin's post.

I will study and apply all suggestions, but at the moment, This is my situation and question.

OI 8.0.3

I have a program that does an OSBRead of a 45MB delimited text file

reading a 128K block at a time, and then using REMOVE, processes a line at a time.

Since the process is updating or growing a set of mv's in a single record, the

fields of interest are parsed into dimensioned arrays.

It also writes a few records to another LH file (and I update indexes after every 50 writes.)

At the end, the dimensioned arrays are unparsed and the record reassembled and

written back to the LH file.

The puzzling thing is that this starts out real fast and then gets slower and slower

until it is so slow at the end it takes about 8 to 20 hours depending on computer speed

However, if I split the input file into 4 pieces, the whole process runs in about 35 min. (3 pieces, about 1 1/2 hrs.)

I can not figure out why. Happy to share my code (2000 lines unless I work it over to remove extraneous stuff) if someone cares to enlighten me.

Doing almost nothing with and certainly not on any arrays over maybe 25 elements.

Where to look for clues to this?

Thanks,

Paxton

At 07 SEP 2008 07:55PM Paul Rule wrote:

I'm guessing the index updates may have something to do with it.

Our experience has shown this can make a huge difference on large files. Can you remove the indexing, do the import, than add the indexes back at the end. (as Mike also suggested)

Also, make sure you don't call yield() too often in a loop this can slow things down massively. Do something like if mod(time(),5) then yield() so it only does it every 5 seconds.

HTH.

At 07 SEP 2008 11:27PM Colin Rule wrote:

Paxton, the Yield does have an affect, but not a huge amount from my experience.

I too have found that it started quickly, and then starts to go slower.

I tried using OSBREAD with chunks of about 500k, to process the file, and this gave me about 300 records written to the data table per second for the first 30,000 records then it starts to slow to around 100 records per second after 70,000 records.

I then created a new table, and predefined 500,000 records.

The above, with this presized table, started at a rate of about 1000 records per second, and even after 300,000 records, it was still running at around 500 records per second, then started to go slower.

I suspect I need to create the table slightly larger, as it possibly depends on the record keys and hashing algorithm, which determine how many of the item go into primary space and how many into overflow.

I had a counter display as the items write, and it does tend to zip along and then slow and then zip along again, so this is probably the time that it takes to write to overflow and resize these frames.

It seems all the other things, whilst perhaps useful for speed, pale into insignificance compared to the predefined table size.

My import over the weekend took 18 hours.

The import today with the above, and presize took less than an hour.

Colin

At 08 SEP 2008 10:01AM Bob Carten wrote:

Sounds like paging file starts writing out to disk. Perhaps a garbagegollect statement after each 128K chunk would help.

At 08 SEP 2008 10:17AM Paxton Scott wrote:

Thanks for the responses.

Recall in my situation, I read one record (which could be empty), parse 9 of the fields to dimensioned arrays, then add to the end or update a cell in the arrays, and sometimes also update a field in a record in a different LH file (adding or removing a value from a field with zero, 1 or 2 values). I am not adding records to any files. Actually, it is 3 records from 3 files and based on the input 'line' one of the 3 records' "array set" is updated. So, there is really no indexing here. On the records in the LH file that are updated, I also found that periodically calling update_index seemed to help.

I probably don't understand this, but I am not using Yield(), but I do have two 'gas gauge' messages, one for the blocks as they are read in and one to process the blocks. Is Yield() embedded there? Maybe so, as the cancel button works.

Maybe that is the culprit?

Paxton

At 08 SEP 2008 10:23AM Paxton Scott wrote:

Bob,

Thanks for that suggestion. I will investigate. I have watched memory with task manager, and all indications is I have plenty. Can OI get all it wants?

This is a 4GB machine, all the memory Windows can see is available.

As I remember, the Oinsight.exe process is less than 100MB and as I remember it is on the order of 60MB. There appears to be lots of free memory.

Paxton

At 08 SEP 2008 11:28AM [url=http://www.srpcs.com]SRP[/url]'s Kevin Fournier wrote:

Paxton,

Most of my work is in user interfaces, and one thing I've learned, is that drawing on the screen uses far more computing power than anything else, even with today's hardware. I've seen processes take two or three times longer (sorry, no hard numbers) by simply introducing a progress bar.

I don't think Yield() is hidden in the progress bars. If it were, it would be even slower. It is indeed a trade off. On the one hand, you can speed things up and make your users wait mysteriously. On the other hand, you can give them a progress bar that takes even longer.

Disclaimer: your mileage may vary.

kfournier@srpcs.com

SRP Computer Solutions, Inc.

At 08 SEP 2008 12:55PM Paxton Scott wrote:

I instituted the convert
to
in var instead of swaps, and

added a flush and garbagecollect between blocks and ran the single 45MB input file. It appears the convert statement definitely improves the speed of processing each block.

I have two gas gauges, one is put up at the beginning of the processing of a block and taken down at the end, the other is put up at the start of the OSBRead and advances as each block is read.

This is a little table of the progress

371 128k blocks total - about 1180 lines per block to process

10 min - 97 blocks 9.7 b/m

20 min - 173 blocks 7.6 b/m

30 min - 233 blocks 6.0 b/m

40 min - 260 blocks 2.7 b/m

50 min - 280 blocks 2.0 b/m

60 min - 295 blocks 1.5 b/m

70 min - 309 blocks 1.4 b/m

Since the block processing gauge is down during the garbage collect, I notice the longer and longer pause for Flush/garbage collect.

At the 60 min mark, it is well over 5 seconds, but was instantaneous at the beginning.

The processing of the block may be taking a little longer also. Hard to tell.

Why is flush/garbagecollect taking longsr?

Paxton

At 08 SEP 2008 01:10PM Paxton Scott wrote:

Kevin,

I agree with everything you say, and can certainly eliminate the progress bar here, as it is only for my benefit.

Do you think the progress bar could contribute to the phenomenon of fast at first, then slowing? Please see my GarbageCollect post.

Paxton

At 08 SEP 2008 02:16PM Warren Auyong wrote:

From my experience when working with files on local drives, in general, the fewer OSBREADs you do (larger chunks) the faster it processes. My guess is the offset pointer on OSBREAD starts at the beginning of the file each read.

I guess one could test this by OSBREADing the file from the end to the beginning.

In OI I rarely use OSBREAD and just to an OSREAD instead. I've not run any tests to see at what size file it may be better to read the file in chunks.

At 08 SEP 2008 02:19PM [url=http://www.srpcs.com]SRP[/url]'s Kevin Fournier wrote:

Paxton,

I don't think the progress bars are responsible for the increasing slow down. In my experience, increasing degradation is usually a result of resources piling up and not being released, be it memory, system resource, file handles, etc. The GarbageCollect was a logical place to start, but looking at your posts, it doesn't seem to be solving all your problems.

I'm really unqualified to comment on how OI deals with memory and file management. What I like to do is narrow down the bottleneck. It's tedious, but I measure cumulative time for every logical section of code, and display the results at the end. That way I can see which part of the process is slowest. Then, I may even see which sub-process is slowest inside that, and so on. That way, I'm solving the biggest bottleneck first and working my way down.

I wish I could be more help, but I know there are other community members with far more experience with OI's data management who can help get to the bottom of this.

kfournier@srpcs.com

SRP Computer Solutions, Inc.

At 08 SEP 2008 02:43PM Paxton Scott wrote:

Kevin,

Thanks for your comments. I have on occasions timed activites as a debugging tool, and yes, will do here.

However, by the crudest of measurements, it is clear that my insertion of a FLUSH and a GARBAGECOLLECT statement between each block greatly speeded up the whole process, I still do not see why is slows as it goes.

There should be no to little memory growth as the bulk of the work is done in dimensioned arrays.

I am very pleased to be gaining on this, has been a good day and thanks to all.

Paxton

At 08 SEP 2008 02:46PM Paxton Scott wrote:

Warren,

Indeed block size does make a difference, I started much smaller and it is a config item. Just had not tried much larger block, as no indication that that block read and "clean" where I find the last line ending set the next pointer.

So, if you are right, the bigger file is a problem…

Thanks,

Paxton

View this thread on the Works forum...