Understanding LHVERIFY detailed stats (OpenInsight 32-Bit) [Revelation On-Line Wiki]

OpenInsight 32-Bit, Martin Drenovac, //www.sprezzatura.com]The Sprezzatura Group[/url], [email protected], Richard Hunt, W Shepard, John Bouley

Join The Works program to have access to the most current content, and to be able to ask questions and get answers from Revelation staff and the Revelation community

At 20 SEP 2007 07:00:31AM Martin Drenovac wrote:

In pursuing an understanding of the values or otherwise of creating files with varying block size values, from 1024 thru 4096. I have come across a .chm writeup, which says that any files where the records are over 100 bytes should be created as 4096 block tables.

So, I've embarked on understanding the current details, and the details of the file under the recommended 4096 byte size. I have a very small file of 190,000 records.

1. LHVERIFY the original file at 1024 byte record blocks.

2. CREATE a new table at 4096 byte blocks, copy all of the (1) data over to get the same 190,000 records.

Running LHVERIFY and then checking the Detailed Statistics brings up a very interesting screen of data. Unfortunately I've no idea of what values where are an indication of "good".

I see differences between the 1024 vs 4096 block sizes, where the detailed stats display says Average Disk IO - The 4096 shows a lower number than does the 1024, so I'm thinking fewer disk reads to get to the same data - this is good.

Thing is, I've no idea of what the data represents - where can I grab some details of what's good and what to look for, what to size files at, etc.

At 20 SEP 2007 10:27AM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:

There's not any formal written documentation, though recent events have made us think that a detailed formal white paper is in order on the full nature of Linear Hash, the impact of the Universal Driver and configuration guidelines on how to best optimise files.

In the absence of such documentation, we at Sprezz Towers will do our best to try to explain what you are seeing.

You didn't mention file size, but let's assume you have a 100 byte file as you mentioned from the documentation. Let's further assume a 7 character key. Adding to that the 2 bytes needed for the record length and key length plus the 1 byte for the end of record mark, the system will require 110 bytes per record. Including the space of the header, a fully packed frame will hold 37 records in a 4096 byte frame and 9 records in a 1024 byte frame.

Assuming the frames are fully packed, there are no blank frames and no overflow, there will be 5135 frames in the 4096 byte frame and 21,111 frames in the 1024 byte frame. If you were to do a full sequential file pass, that's the number of reads it would take to get all the records, meaning a file with frames 4 times as large has 4 times fewer reads.

Given that statistic, a 4096 byte frame is by far your best bet.

Do you sense a "but…." coming along?

But (wow..psychic, man), that's really only a meaningful statistic when doing full hashed order file access. Most of the time, you'll be doing individual record reads for processing, or processing those records in non hashed order, like a date sort. In that case, there's a different set of statistics we need to look at.

To refresh, our 4096 based frame holds 37 records while our 1024 based frame only holds 9 records. That means for each individual, random access record read, the 4096 based frame has a potential 37 key checks before finding a the correct record while the 1024 only has 9. Because of the way the hashing algorithm works, the filing system can check up to 4 groups before deciding that the record does not exist. That means the maximum possible check for a record is 148 for a 4096 frame and 36 for a 1024 frame. Given that the 4096 has 4 times the amount of records per frame, on average 75% of your reads will be slower on the larger frame size than they would be on the smaller frame size.

Given that statistic, a 1024 byte frame is by far your best bet.

Do you sense another "but…." coming along?

But, in the real world, files don't hash this way. These statistics just represent the extremes of the hashing algorithm. In the real world, you'll have about 20% of your primary frames empty and about 60-75% of your data in overflow. That changes the statistics in many, many ways. As your data starts moving into overflow, and it will, even with 100 byte records and 4096 frames, more and more records will be added to a single group, meaning it will take more and more checks to find that single record. However, on the 1024 byte frames, you'll be creating more and more frames requiring more and more disk reads.

The system does not, to the best of my knowledge, read in the entire group at a time, it just reads in frame by frame. Assume now that our 1024 byte file has 3 overflow frames. This means that it's now 4096 bytes in total. Even though we're still taking 37 record comparison checks, we're doing 4 disk reads instead of the single disk read for the 4096 byte frame. That means for record 37 in the hash order, it's faster to get the one in the 4096 frame than the 1024 frame.

Given that statistic, I have no idea which one would be faster, on average.

It depends on where and how your records hash across the file. Sequential numeric keys tend to hash in clusters. Random character keys hash better. Keys like government ID numbers, which tend to be the same number of characters, tend to hash in clusters, even if they are alphanumeric.

You can also start manipulating your results by modifying the threshold value. The threshold is the amount of primary space that much be used before the system will rehash the data, adding more groups. Normally, this is 80%, meaning you will generally have 20% of your frames empty. If you drop this value, you'll have more groups in your file since the system will rehash sooner. While this will create a larger LK file on disk, the records will spread out more, meaning that, on average, you will have less records per group. A group is the primary frame and it's associated overflows. This increases your per record access speed, but decreases your hash order scan speed (non-indexed selects).

As you can see, there is not a hard and fast rule as to how to define your file for the fastest access.

I hope this cleared up a few things, even as it most likely raised more questions.

Your friendly, neighborhood Sprezz Guy.

The Sprezzatura Group

World leaders in all things RevSoft

At 20 SEP 2007 10:58AM [email protected] wrote:

From high atop Obfuscation Towers in loverly London, the melodic tones of SprezzGuy

Have you ever written a routine to actually analyze tables? There are some in PICK and uniVerse to check overflow etc. Testing the speed bit would be a bit of work.

At 20 SEP 2007 12:46PM Richard Hunt wrote:

I consider that having a single read that takes 10 miliseconds longer is much better than having a multiple read process take minutes longer.

What I mean is that if the single read is a blink of an eye longer then I really can not see the user even noticing the slowdown. Although having a multiple read process take minutes longer can really have a frustrating effect on a user.

I have no proof of the time that will be saved or lost.

At 21 SEP 2007 12:48PM W Shepard wrote:

Another consideration is delete time. In my subjective experience, the record delete function is one of the most i/o intensive jobs on a file. Going to the larger frame size dramatically improves the time required to delete a record. I mean dramatic: I've noticed jobs that delete large numbers of records finish in a tenth of the time it took before the file was reframed at 4096. I always frame volatile files at 4096.

At 21 SEP 2007 01:33PM John Bouley wrote:

Presumably this must be due to the fact that the majority of records are in the primary frame and not in overflow?

John

At 12 OCT 2007 03:08PM W Shepard wrote:

No, I think it is because when a record is deleted, all the records after it in the page must be moved forward. This means that if the record is an old record and located at the front of the page, every sector of the page must be read in, modified and written out. This is unlike writing where the new record is added at the end of the page and only the last sector must be written.

Wayne

At 13 OCT 2007 12:21PM Richard Hunt wrote:

It would be nice to have the complete listing of the "linear hash" (RTP57A) argument one valid codes. Then you could effectivly program a routine that would give you the stats you need.

Although, it is not so hard to read and accumulate information from "linear hashed" files. The information needed is in the "Knowledge base".

You could OSBREAD the frame info, list of ID's, overflow free list, overflow frames, "hashing" results for ID's (you need RTP57A and "GN" for arg1), and actual record extraction.

That would give you almost all the info needed to produce reports or results like from the "PICK" or "UniVerse" systems.

It would be nice to have a section on the revelation website so that we could share code between ourselves. I have a subroutine that does the above things. I am just not so sure it is perfect, it does seem to be. I am not an expert in linear hashed file structure. I use it so that I can isolate any single group problems and get complete detail of single groups when I need it.

At 16 OCT 2007 04:29AM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:

You can get more detailed explanation of the codes returned from the SYSTEXT records, DICT.SYSLHVERIFY*PARAMS*1 & 2. 2 seems to just have swappable parameters for 1, so they're identical.

There's field information in the VERIFY_LH help record.

AREV had a dictionary mapping for this and SYSLHGROUP.

The Sprezzatura Group

World leaders in all things RevSoft

View this thread on the Works forum...