Size of LK vs OV for index tables (OpenInsight Specific) [Revelation On-Line Wiki]

Sign up on the Revelation Software website to have access to the most current content, and to be able to ask questions and get answers from the Revelation community

At 06 OCT 2000 09:06:36AM Oystein Reigem wrote:

OpenInsight Specific

I have a couple of new data volumes where the OV part of index tables tend to be four times as large as the LK part. The worst I've seen elsewhere in my apps is a OV/LK ratio of about 1.65. I'm a bit concerned since the volumes contain data to be accessed from the web. I'd like the access time to be as low as possible. Can anybody think of an explanation? Or a cure? Can the frame size be an issue - just as with data tables? What is it really that goes in the OV part of an index table? I've got some old LH docs, but I don't think they mention indexes.

- Oystein -

Øystein Reigem,

Humanities Information Technologies,

Allégt 27,

N-5007 Bergen,

Norway.

Tel: +47 55 58 32 42.

Fax: +47 55 58 94 70.

[email protected]

Home tel/fax: +47 56 14 06 11.

At 07 OCT 2000 02:27AM Warren wrote:

Check to see if the sizelock is ]0.

An index file is just another data file as far as the LH filing system is concerned. Basically once the base frame is filled any other data that either hashes to the same group or belongs to a record in the base frame goes into the overflow frame. At some point the file should resize and either increase or decrease the modulo and in theory change the record distribution.

A 4 to 1 ratio of OV vs LK is not unusual for index files.

At 09 OCT 2000 09:17AM Don Miller - C3 Inc. wrote:

Oystein ..

Warren is right about how overflow frames get created. It is possible to create a new table of approximately the right size to minimize overflow (unless you get the dreaded key clustering). It is much more difficult to do this for indexes. The ! is created and added to the first (and subsequent) time you create or rebuild the indexes. If you have a 4:1 OV to LK ratio, I wouldn't be tremendously concerned. If you want to minimize this, you can sometimes use DUMP in arev to compress a table's overflow frames. This will tend to put more frames in the LK portion. Sometimes changing the threshhold percentage will help since it will cause resizing to occur more frequently. At least that's the theory .

Don Miller

C3 Inc.

At 09 OCT 2000 09:33AM Oystein Reigem wrote:

Warren and Don,

Thanks for your response.

You say 4:1 is not bad, but it's worse than my other index tables. Can a reason be something Don hints at - that there hasn't been enough resizing? My other data tables are updated every now and then, and so are the indexes. Lots of chances for resizing. The set of tables I'm concerned about now are essentially indexed once. First data are written to the tables in one or more "export" processes. Then the tables are indexed. After that the content doesn't change. There is only one index that is active and updated during the "export" process, all the others are built afterwards.

- Oystein -

At 10 OCT 2000 09:49AM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:

Resizing should occur on a read as well, so if the index is used frequently, it should resize itself.

Most likely it's just clustering. If you are really concerned, just decrease the threshold to 50% or lower. That will cause more resizing and create a more even distribution. It will increase the LK file size though.

The Sprezzatura Group

World Leaders in all things RevSoft

At 10 OCT 2000 12:16PM Oystein Reigem wrote:

Sprezzatura,

Resizing should occur on a read as well, so if the index is used frequently, it should resize itself.

The indexes haven't been used yet. So it's too early for me to tell.

Most likely it's just clustering.

As in hashing to the same group?

If you are really concerned, just decrease the threshold to 50% or lower. That will cause more resizing and create a more even distribution. It will increase the LK file size though.

What I'm concerned about is access time. I just want as fast access I can get, both to data and indexes, because it's public access from the web.

File size is not an issue.

Here's more specifics about my tables:

Data rows are mainly accessed in random order.

Indexes are accessed in connection with queries.

Indexes are also accessed from my own index lookup routine. A typical request is to get 20 consequtive values (not keys) starting with a certain value. The index lookup routine uses a BtreeRead to find a starting point and normal read's to follow pointers.

Once they've been prepared the data tables are static. Does that mean I should change the threshold to 100%?

You suggest lowering the threshold for indexes to avoid clustering, but cannot that also cause an increase in access time (more frames must be read because there's less data in each)?

- Oystein -

At 10 OCT 2000 02:35PM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:

Clustering is a large amount of records hashing to a single group.

As for your access times, there are a few things to keep in consideration. Each read is a seperate record access, therefore, each read will requires another hash and another hit on the server. If the records happen to hash to the same group, then you can get the next read on the cache. Otherwise, it's back to the disk.

The system does not read the entire group in at once. It reads each frame until it finds the key, then reads the groups necessary to get the entire record. So, yes, the more records in a group, the potentially higher the disk requests.

Remember, the resizing only occurs when the LK portion fills up enough to reach the threshold. The OV portion can grow and grow without affecting resizing at all.

A lower threshold will create a larger LK file, with gives each record a better chance of being in it's own group.

The Sprezzatura Group

World Leaders in all things RevSoft

At 11 OCT 2000 10:20AM Oystein Reigem wrote:

Sprezzatura,

I really appreciate your response. Please enlighten me some more. Apologies for turning this thread into a tutorial.

First tell me if I got the basics right:

- A table consists of groups.

- Each group contains one or more frames.

- The first frame of a group (the primary frame) is in the LK file.

- The other frames of a a group are in the OV file.

As for your access times, there are a few things to keep in consideration. Each read is a seperate record access, therefore, each read will requires another hash and another hit on the server. If the records happen to hash to the same group, then you can get the next read on the cache. Otherwise, it's back to the disk.

Tell me more about caching, please:

- Is it groups or frames that are cached?

- For how long can I expect data to remain in the cache?

The latter will of course depend on various factors. E.g:

- How many caches are there?

- - One per table?

- - Or a fixed or limited amount of caches - one for each of the n last tables attached?

- - Or one common cache for all tables?

- Does it matter if a group/frame is accessed often (will it boost its chances of survival if it's accessed several times in a row)?

- Etc.

A lower threshold … gives each record a better chance of being in it's own group.

I see. But what if there was no clustering? Then 100% would be better than 50%. Because each file access would get twice as much data. The file system or app would not always need the data in that second half of the frame, but often it would. Am I right?

My tables are filled with data over a period of minutes to months. Then they become static. What if I start with my tables at 50%, and remake them with 100% after they become static?

- Oystein -

Øystein Reigem,

Humanities Information Technologies,

Allégt 27,

N-5007 Bergen,

Norway.

Tel: +47 55 58 32 42.

Fax: +47 55 58 94 70.

[email protected]

Home tel/fax: +47 56 14 06 11.

At 11 OCT 2000 03:51PM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:

Tutorials are good things…

Yes, you do understand the basics. That was a very good summation.

Regarding caching, it is frames that are cached, up to 10 at a time, I believe. Data remains in the cache until the next read.

There is an interesting issue with caching and frame sizes. One thing to remember, this problem does not occur with any of the new network products, this means the NLM and NPP/NT Service.

Here's some text from a document I started writing more years ago than I wish to count…(Word claims it was created 12 Jan 94!)

Frame Buffers and Cache

Advanced Revelation maintains six internal frame buffers. Each buffer is allocated to the largest frame size for the current session. By using the default 1K frame size, the system allocates 6K of memory to the frames buffer. By increasing a single file's frame size to 4K, the total size of the frame buffer increases to 24K (six 4K buffers).

During any read request, the system fills up the frame buffers. If reading from primary frames, the buffers contain only primary frames. If reading from overflow, the buffers only contain overflow frames. When all the frame sizes in a system are identical, each frame buffer contains exactly one frame. When you have one frame that is larger than the rest, the system begins to run into what I call frame buffer overflow problems.

LH will attempt to fill up the buffers, especially on selects and readnexts. Generally, this does not matter much when dealing with primary frames, but makes a big difference with overflow frames. LH will completely fill up the buffers needed. If the buffer is 4K, the system will load up 4K into the buffer. If you have a 1K frame, this means that 4 frames will be in each buffer.

When the system checks for records, it only looks at the frame size, it does not take the buffer size into account. When it cannot find the record in the first buffer, it jumps to the next buffer. The next buffer does not contain the next frame however. It contains the 5th frame. LH's error checking comes in, finds what could be a GFE, so it re-reads in the frames to validate the error. This time, it doesn't find one, so processing continues. This error/re-read loop continues until the record is found.

Remember, this was supposedly reworked with the new network products so it should not be an issue if you are using them.

Regarding clustering, there will almost always be clustering. The best way to eliminate clustering is to remake the file or pre-size the file and keep sizelock at 1. The more groups you have, the more chances you have of finding an empty group to hash into.

Another thing to consider, which I'm sure will create more questions, is that on files allowed to grow "naturally", the system has to find the older records.

Suppose I have a file with 1,000,000 records. When the 3 records were created, they went into groups 1 and 2 and 4. Now, after 9,999,997 more have been added, these records might still be in groups 1, 2 and 4. The system needs to continually rehash the record key until it determines there are no more possible groups for the record to belong until it determines the record does not exist.

For example, looking for record 1, the system hashes the record based on the current 63,000 group file. It determines the record belongs in group 34,351. LH looks at group 34,351 and tries to read the record. If it cannot find the record there it loops back through the code, replacing the 63,000 with 34,351. This means the system hashes record key 1 with a current modulo of 34,351. This time, the system hashes the record key to group 2,314. Again, the system looks for the record, does not find it, and we repeat the whole process with a modulo of 2,314 until we either have the record or end up in a full repeating loop.

This is the purpose of the group modulo value.

The Sprezzatura Group

World Leaders in all things RevSoft

View this thread on the forum...