Novell Abend, NLM, and low priority threads (AREV Specific)
At 21 FEB 2002 08:28:53PM Michael Gwinnell wrote:
Have a Novell 4.2 server, all latest patches. 250 users, decent disk space availability, decent server memory availability, etc. Arev NLM version 1.5a.
Have huge files with large items with heavy activity. Some files are over 3GB (say 3.4GB max) with sub-200MB LK and 3.2GB OV portions. Same files have over 3.2million records.
Recently abended with a message about low priority threads.
Prior to abend, one workstation displayed a GFE in one of the large files, and the server pegged at 100% utilization.
After finally being able to down the server, and verifying the file, we ended up with 10 GFE frames.
Dump shows (just about) every frame as a GFE, even though LH_Verify does not (I know, dump is not as 'smart' as LH_Verify). All frames show as GFE during the dump process even on an older version of file (still large and many overflow frames, and even MORE than 3.2million records).
Question is- Has anyone else experienced this kind of evil shutdown, and the 100% utilization, and been able to determine it's cause? Could it be the Arev NLM being overtaxed with the file and group sizes and not being able to handle them?
Also, attempting to fix the errant frames fails, and just empties the group, failing to re-write any of the data back to the group, and hanging the dump-fix process. After 'fixing' the 10 frames, we ended up losing about 30-50 records per frame (as nearly as we can tell).
Finally, offloaded the GFE'd file and a post-fix copy of the file to a standalone PC, running a standalone copy of Arev, and an unable to list the file. I can DUMP the files, but the list process fails immediately, no items displayed. Does the network and/or NLM provide a more intelligent method of actually allowing Arev to access the data, or do you think I just had two bad copies (both LK and OV were copied and have the same byte count as server version)?
Thanks in advance!
MEG
At 21 FEB 2002 09:12PM Pat McNerthney wrote:
I would be very concerned about your huge file sizes.
In particular, the 2 gig threshold is when the 32nd bit of a 32-bit unsigned integer value kicks in, and if the LH code is not properly using unsigned integer values, but is incorrectly using a signed integer, there would be problems.
Also, 4 gigs is the max for 32-bit integers anyway, so that will be an eventual hard limit.
I know that Linear Hash has not been stress tested beyond the 2 gig file size in the past.
Pat McNerthney
At 21 FEB 2002 09:24PM Victor Engel wrote:
Have huge files with large items with heavy activity. Some files are over 3GB (say 3.4GB max) with sub-200MB LK and 3.2GB OV portions. Same files have over 3.2million records.
There have been threads on maximum file size here in the past. You probably should read up on them. You may be bumping into a barrier at 2 Gig and/or 4 Gig.
Dump shows (just about) every frame as a GFE, even though LH_Verify does not (I know, dump is not as 'smart' as LH_Verify). All frames show as GFE during the dump process even on an older version of file (still large and many overflow frames, and even MORE than 3.2million records).
Note that DUMP uses a completely different method to access the data than reading records through RTP57. The DOS files are read directly using DUMP. Even if the NLM is able to handle such huge files, DUMP is probably running into a barrier.
Question is- Has anyone else experienced this kind of evil shutdown, and the 100% utilization, and been able to determine it's cause? Could it be the Arev NLM being overtaxed with the file and group sizes and not being able to handle them?
Years ago we had a similar problem that wound up being a system board problem on the server. Another problem with similar symptoms wound up being a bad spot on the drive. I guess with random data coming back, there's no telling what could've happened.
Also, attempting to fix the errant frames fails, and just empties the group, failing to re-write any of the data back to the group, and hanging the dump-fix process. After 'fixing' the 10 frames, we ended up losing about 30-50 records per frame (as nearly as we can tell).
Did you mean frame or group? Fifty records per frame would be very small records. Were you able to recover the records? There are utilities available to recover the records prior to fixing the group, but after the fix, they are basically unrecoverable.
Finally, offloaded the GFE'd file and a post-fix copy of the file to a standalone PC, running a standalone copy of Arev, and an unable to list the file. I can DUMP the files, but the list process fails immediately, no items displayed. Does the network and/or NLM provide a more intelligent method of actually allowing Arev to access the data, or do you think I just had two bad copies (both LK and OV were copied and have the same byte count as server version)?
How much of the local copy did you dump? Were you able to "walk through" the file using dump? Did you copy the dictionary locally also, or just the data?
At 22 FEB 2002 07:51AM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:
There are three ways to move a file pointer through an open file at DOS operating system level. This file pointer determines the offset in the file where the next read or write takes place. It's like your OSBREAD and OSBWRITE offset - same idea.
DOS Interrupt 21h function 42h (Set file pointer) is called with a method code in the AL register, and the file handle in the BX register.
The register pair CX:DX holds the offset (a 32 bit integer=4GB max) when AL=0h (absolute offset from start of file). If AL is set as 01h or 02h then the current file pointer is moved relative to is present position (from start or finish of file), and the CX:DX pair is therefore signed. 2gB is the limit for signed 32 bit numbers, so a file pointer can only move relative to its current position by 2gB.
So you're really pushing the limits of what the standard DOS file operations can achieve in a 16 bit environment, especially if AREV.EXE ever uses these relative calls.
There are other 2GB and 4GB limitations in DOS, and you've probably been lucky to date your files haven't been more frequently corrupted.
World Leaders in all things RevSoft
At 22 FEB 2002 02:15PM Michael Gwinnell wrote:
So, lets see if I understand-
There is a 2GB limitation on a pure-DOS file, using 16 bit DOS. There is a 4GB limit on Arev files.
That much I've already known.
But now what I am hearing is that, rather than imposing a 4GB limit to Arev total file sizes, we should impose a 2GB limit to those files.
Is that correct?
Also - sorry Victor, I was utilizing the term Frame and Group interchangeably on my previous posting. Yes, the GROUPS were being dump-fixed, and there were (probably) 30-50 records per GROUP. Also, keep in mind the large disparity between LK and OV portions of the file, the overflow is excessive.
Victor- you mentioned that there are utilities to recover records prior to fixing a group. I have searched to no avail. Could you give me guidance to one of those utilities? It would be much appreciated.
Pat- Thanks for your input. Is there some way or some one who could verify whether the LH code is properly using the unsigned integers?
Thanks again for your assistance!
MEG
At 22 FEB 2002 02:22PM [url=http://www.sprezzatura.com" onMouseOver=window.status=Click here to visit our web site?';return(true)]The Sprezzatura Group[/url] wrote:
As an interim solution, you could use an MFS to construct a logical file from multiple physical files (as BIG.MFS used to) using a "pre-hash" to subdivide the file so that the total file size stays within "safe" limits.
World Leaders in all things RevSoft
At 22 FEB 2002 05:26PM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:
But now what I am hearing is that, rather than imposing a 4GB limit to Arev total file sizes, we should impose a 2GB limit to those files. Is that correct?
Well if you limit the AREV files and they can't resize, that's going to mean instant corruption (almost like out-of-disk-space condition).
All we were suggesting was that AREV is dependent on DOS limits, and that DOS has a few 2GB boundaries inherently. Whether AREV encounters these can be tested by a test program using an oswrite statement, or, as in your case, populating a really large file.
The AREV LH filing system is extremely dynamic - to use an MFS to split the file up into multiple physical files is probably your safest bet - gentler on your operating system, your hard disk, and your available RAM.
There are plans for a future SENL article on LH filing system characteristics.
World Leaders in all things RevSoft
At 25 FEB 2002 09:40PM Victor Engel wrote:
Michael,
Send me a private email at [email protected]. I thought others had posted utilities here, but I may be thinking of utilities to move data to flat files or to access the data directly from other applications.
I do have something that may be of use to you that you may wish to modify for your own use.
At 27 FEB 2002 02:39AM Victor Engel wrote:
I just ran a program to test this out. Interestingly, I wrote the program to stop when the size of the OV file exceeded 3.5 Gig. The program crashed when the file hit 4 Gig. Why? Well, the formula I used to determine file size maxed at 2 Gig and then started decreasing as the OV file kept growing. The formula uses the DIR() function to determine the size.
I reran the program and manually stopped it when the OV file hit 2.4 Gig. Then I ran DUMP on it and got GFEs just as you did. Periodic DUMPs at under 2 Gig resulted in no GFE.
Note that these are not true GFEs. The file itself is OK. A select on the file proceeds without error. My utility to retrieve records in a group also works fine.
DUMP evidently is using a DOS function limited to 2 Gig whereas the linear hash driver I'm using (non-networked driver) doesn't seem to have this problem.
However, as soon as a write is performed at the 4 Gig boundary, there is a FS error message and the program terminates.
I'm running on Windows NT, and the error message I get indicates the operating system has denied access to the file. A similar message is produced when I try to access the file using utilities such as Textpad, Wordpad, etc.
Actually, I just took a closer look at the results of my utility to copy groups. It did not seem to be copying all the records, so I wrote another program to test if OSBREAD can go past 2 Gig. It can't. The file is over 2 Gig, but any attempt to read past 2 Gig results in a null string being returned. A test of OSBWRITE, not surprisingly, had a similar result.
The bottom line is that my utility won't be helpful to you since it was written in R/BASIC using OSBREAD and OSBWRITE.