Indexing the XREF index (AREV Specific)
At 04 AUG 2000 01:14:28PM Wilhelm Schmitt wrote:
In order to improve lookups in a large data file (and its corresponding large XREF index), we want to create a secondary file where each indexed word is stored once, together with the updated record counter for that word.
The data structure for the SECONDARY_FILE is:
KEY: WORD as found in xref_index
FLD1: RECCOUNT of that word
FLD2: LAST_UPDATE
Then we would put a BTree index on that SECONDARY_FILE,
to speed up complex search options on the original data file.
(Something similar to search engines found at most internet sites).
How can the secondary file be updated accurately, when the primary index (XREF) is moving continously?
Thanks in advance for suggestions.
Wilhelm
At 05 AUG 2000 03:28AM Warren wrote:
The only way I could think of would be use an MFS. It would be a bit tricky to keep it in synch though. You could create a program to do the initial counts and the MFS would increment or decrement the counter on adds/deletions/changes.
You could always do some treewalking on BTREE records in !file to get the counts. This too would be tricky because it would require a thorough knowledge of the record structures in the !file but potentially be faster than scanning the all the records in the datafile to get the inital counts.
At 05 AUG 2000 11:54PM Wilhelm Schmitt wrote:
Warren,
I think we direct our efforts towards some sort of MFS, just before SI.MFS. This should give us a relatively accurate update, although there is the timing problem with a slight chance, that the dedicated index station has not updated the pending transactions, when a user runs into a word not-yet-indexed.
Several months ago, I posted a message on this forum trying to find out how to walk quickly through the !index file with collect.ixvals(). Nobody answered, but I found some basic reference on the !index record structure in TB#101 on btree.read (available at http://www.revelation.com/WEBSITE/knowledge.nsf/89e60900cf7ebbc8852566f500654ecb/922322b08a81d2ff852563e200499a00?OpenDocument).
We require however a much quicker and more transparent access method - and the only thing we could think of is a separate word-list with a Btree-index, acting as the first query filter.
=] This is our problem:
We do index-lookups in a web-application with OpenInsight's inet_rlist (which, by the way, is much better documented in AREV's RLIST than in OI). On small datafiles this works great and requires almost no programming. The catch is: … our datafile has some 700,000 records with variable text and the !index.file size alone is over 200MB.
With the wrong search word (not yet updated) or with "substring"-type words, OI starts hanging. It seems to be walking through the index from top to bottom, and there is no way to trap the error and stop the search (except manually!). Something similar happens when the client asks for a frequent word. OI does not give up, until the last key is rendered!
You can imagine the user's reaction and the consequent server problem, when several other update requests (for exactly the same query) arrive from the client's browser, because of the apparent inactivity. This is worse than a distributed denial-of-service attack!!!
=] These are our goals:
*) We first want to evaluate the search words. If, for example, the user picks a word with 10,000 hits we want to send him a message first, where he can chose between refining the search until the selection yields less than 200 keys, or have the list processed "off-line", and, after finishing, mail a message to the user, allowing him to pick up any of the 50 partial 200 record lists. Then OICGI would come into the picture and process the small lists easily and quickly. AREV will be in charge of the raw processing (this is, because we do not feel comfortable with OI's strait-jacket).
*) Apart from the normal complete-word searches our web-engine has to resolve FAST and flexible queries, like these:
a) WIL] =] should return WILHELM, WILLIAM, WILLIAMSBURG
b) S*MI* =] should find SCHMITT, SMITH, SMITHSONIAN
c) ATION =] should find REVELATION, TRANSLATIONS, RATIONAL
d) Soundex type searches
*) Of course, we want to put the same code to work from within our internal operating environment: an AREV-based WAN.
We feel, that Revelation's variable length fields and the linear-hash structure have the power to accomplish that.
=] Our actual to-do list
*) Define the structure of the secondary word list
*) Define the update methods of the word list
*) Define the search options and methods
*) Chaining access (through word list and primary index) to render the final key list(s).
Did we forget something important?
Regards
Wilhelm
At 06 AUG 2000 12:21AM Warren wrote:
FWIW:
Not that it helps but some how Ceridian's HR-1 begins with searches (e.g. WIL]) ran exponentially faster in Window's queries (i.e. using the slash key) then they did under plain ARev Windows (HR-1 ARev version 2.03 vs RTI's ARev 2.12). Even though WHO in HR-1 (ver 5.x? I don't recall) said 2.03 it did not use the !indexing file.
Whatever tweak Ceridian did to the 'begins with' search never seemed to make it into RTI's ARev versions.
Best wishes with your design efforts.
At 07 AUG 2000 12:48AM Richard Bright wrote:
As you are aware, the secondary index - native - doesnt get updated as a consequence of changes to the source because it has no way of knowing that the source has changed. However there is an undoced proceedure that you can easily add to the Dict of the secondary value. This is to establish a dependacy pointer in the MFS by using the dict field 21 - inserting in format TABLE*COLUMN. For more information buy REVMEDIA from Sprezzatura and look at Vol 2 issue 4.
Richard Bright
At 07 AUG 2000 04:40AM [url=http://www.sprezzatura.com" onMouseOver=window.status=Click here to visit our web site?';return(true)]The Sprezzatura Group[/url] wrote:
Buy REVMEDIA? Great idea … or view it free on line here
World Leaders in all things RevSoft
Pull down this menu to choose whereabouts on the Sprezz site to go
Home Page
What's New
SENL
Download S/LIST
Send mail to support at Sprezzatura
Send mail to sales at Sprezzatura
</FORM
At 09 AUG 2000 04:13PM Wilhelm Schmitt wrote:
Richard,
From what I see your tip is applicable to relational indexes.
We no longer use them in our applicatiosn and we cannot have a 64K barrier either.
Thanks for your suggestion, anyway.
At 11 AUG 2000 03:45AM Warren wrote:
If you have the REVSRC archive (should be on the 2.1x-3.1x discs somewhere) in BTREE_SOURCE there is a routine COUNT.IXVALS that might do what you want.