When stressing Linear Hash & indexing on Win 2K3 server? (AREV Specific)
At 09 FEB 2006 12:50:31PM Michael Slack wrote:
A while ago, one of our remote sites decided to move their AREV 3.12 environment to a 2003 Server. They had a lot of indexing problems. They finally moved everything back to the 2000 Server. Apparently they didn't have the right drivers for the 2003 server.
At our main office, we've setup a test server to make sure we've got everything loaded and setup properly before having the remote sites migrate the AREV environments to a 2003 server. Our test server is running with Linear Hash 2.1 and All Network Drivers 2.1.
Because our remote site had indexing problems, part of our testing has been to stress Linear Hash and indexing. The way we've done this is by writing a set of programs to create data in a particular table that has columns and indexes already defined. Then there is a program to modify the data. Also, a program to delete data. After each of these we use a program to report the results. All of this is done so the results are predictable. There is a group of indexed and non-indexed columns in each row that get set at creation time and never changed afterwards. These are control columns. Then there is a group of indexed and non-indexed columns in each row that get modified during the modification process and are used to pick which rows to get deleted during the deletion process. Each group of columns has a mixture of columns that hold only numeric values or only string values or only string values that are treated like a description.
Our results were good when we were using background indexing and the report program flushed indexes before collecting the results. We only had a couple of machines connected to our test server at any one time. Then we setup a dedicated indexer and turned off background indexing and reran our tests.
We're getting some incorrect results now that we've setup the dedicated indexer. The results aren't entirely consistent and so far I can't explain why. The only change to the testing process is to comment out the index flush command in the report program. But I've run the same exact report on the same exact data (without any modification) the next day and still got some incorrect results.
To really stress Linear Hash and index, what we've been doing for the most part is running each process simultaneously, either on the same application from the same work station or two different applications from the same work station. For example, I'd start with an empty table. Open two windows to the same application and start the creation program in each as simultaneously as I could. Typically, I'd have one instance creating rows 1 thru 5000 and the other instance creating rows 10000 thru 14999 so there couldn't be any over lap in the row key values. Then run the report program simultaneously in each application. Then run the modification program and then the report. Then run a series of deletions with a report after each one until there are zero rows in the table.
What I'm getting for results while the dedicated indexer is on, is incorrect row counts on some of the columns. Sometimes the counts are more and sometimes less that what they should be. The incorrect counts are not always on the same columns. For example my last batch of testing was using one machine with two windows to one application at the same time. I ran the creation program and then made sure the indexes where updated on the table I'm using. Ran the two reports (one in each window) and got back some incorrect row counts. Then closed one window and reran the report again and got all correct numbers. Then opened a second window into the same application and ran a report in each at the same time. I got incorrect counts. The thing is that even the bad counts didn't match the bad counts from the first set or reports. Even which counts give bad results aren't consistent. I then turned the dedicated indexer off and reran the two reports again and still got bad row counts.
We've tested with each window with different applications and gotten simular results.
Before we setup a dedicated indexer on our test server, we ran the same tests with background indexing turned on. The thing is that the report program would flush all the indexes on the test table before actually starting the report. We got correct and consistent results this way.
I'd like to know why we're getting some incorrect row counts. Is it that we're just stressing the system or some part of it to its max and thus getting a little flaky at the edges? Has anyone heard or experienced anything like this? Does anyone have any suggestions on what I might try?
Thank you for your time.
Michael Slack
At 10 FEB 2006 03:15PM R Johler wrote:
If you are using the GET.RECCOUNT function with the 'force' option false, then on active files the row count in the header can be wrong.
If you are using the 'force' option true or from TCL "COUNT tablename" during a period of no activity to the 'tablename' it should report a correct count (as far as I know).
At 13 FEB 2006 10:24AM Michael Slack wrote:
I'm using a CLEARSELECT before each SELECT statement. I then use the resulting @RECCOUNT value to display on the report.
Further example of what I'm seeing. Friday, I opened two seperate applications thru my Win XP desktop. Made sure that the tables I was using to create rows in and report on were empty and that the indexes were updated. I was the only one using the applications at the time besides the dedicated indexer. I then created 5000 rows in each table at the same time. I then ran the report program on each at the same time. One report gave me all correct counts. While the other report gave me mostly correct counts. There were two counts near the start of the report that gave me a count of 5460. I double checked the total count of rows within this table and there were only 5000 rows total. This seems tipical that the incorrect numbers are near the start of the reporting process. As one process runs a little faster than the other (because it's the highlighted window), the two processes are selecting on different sets of data and indexes.
My gut feeling is that there is some sort of bleed thru as the two reports stress my desktop machine or the network or the server or the indexes or linear hash or some combination. Since these results are so unexpected, I hoping to find a reason for these oddities that maybe I can account for or correct for.
Thanks,
Michael Slack
At 13 FEB 2006 11:04AM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:
Are you getting unique station ids?
World leaders in all things RevSoft
At 14 FEB 2006 07:27AM Hippo wrote:
There can be problem as you use two processes on one station … I don't have experience with it, the other problem may be with dedicated indexor … often the process stops when there is nothing to index. The process does not restart when new pending transactions are created. It is not easy to force deditaced indexer not to stop working (we have changed the background process to regularly change an indexed field in some dummy table, but even in this case it sometimes stops).
I have simillar problem from last monthly clossing … the following sequence of commands:
PERFORM "PDISK file1 (O)"
PERFORM "SELECT table"
PERFORM "SAVELIST keylist"
PERFORM "GETLIST keylist"
PERFORM "LIST table some columns (PS)"; * no BY clause
PERFORM "GETLIST keylist (S)"
PERFORM "PDISK file2 (O)"
PERFORM "LIST table other colums (PS)"; * no BY clause
PERFORM "PDISK PRN"
I have obtained file1 and file2 where one row was missng in file2.
I can see error in the code … PDISK file2 should precede GETLIST keylist to obtain same results. But the row which was missing in file2 was created several months before.
… can it be the table resizing (may be with background indexing)which causes move of the row from one group to another such that list command misses the row?
At 14 FEB 2006 08:02AM Warren Auyong wrote:
Shouldn't you close the PDISK file1 before capturing to file2?
At 14 FEB 2006 10:25AM Hippo wrote:
No, "PDISK file2 (O)" closes currently open file1 and (re)creates file2. "PDISK PRN" is optional (before it).
At 14 FEB 2006 12:39PM Michael Slack wrote:
No. I opened a seperate window to a seperate application on my work station. From TCL I ran "EVAL PRINT @STATION". Each one gave me the same exact station id. I then open two windows to the same exact application and got the station id for each. Both gave me the same id.
Is there something I can try? Or is there some setting I might need to modify?
Thanks,
Michael Slack
At 14 FEB 2006 01:10PM Michael Slack wrote:
We don't seem to have a problem with out dedicated indexor process stopping. Our live dedicated indexor is very stable. The people who keep an eye on it tell me that usually at most once a week they come in, in the morning to find that it has stopped. It's usually on Thursday monring. They don't know why but assume some weekly maintenance process over night is the cause. But they just start it up again.
I'm must be missing something in reading your snippet of code. I'm not sure what you are trying to show me.
Below, I've pasted the skeleton of our process for our dedicated indexor. In case someone sees anything out of place.
Thanks,
Michael Slack
=======
INDEXER.BAT is the process that is run on our dedicated indexer machine. This starts each of the applications that we want on the dedicated indexer to work on. This keeps cycling until someone turns it off or something shuts it off. The one working on our Live applications is very stable. I'm told that only about once a week will it be found to be off. Then it's started back up again and checked once or twice a day to make sure it's still going.
=======
REM @ECHO OFF
CLS
rem INDEXER.BAT
rem Starts MAPCON Dedicated Indexer Cycle
ECHO RUNNING AREV INDEXER
n:
:CYCLE
CLS
ECHO RUNNING MAPCON INDEXER ..
IF ERRORLEVEL==1 GOTO SKIP1
AREV IDXMCM /X
:SKIP1
IF ERRORLEVEL==1 GOTO END
IF ERRORLEVEL==1 GOTO SKIP2
delay
AREV IDXPLE /X
:SKIP2
IF ERRORLEVEL==1 GOTO END
IF ERRORLEVEL==1 GOTO SKIP3
delay
AREV IDXPTS /X
:SKIP3
IF ERRORLEVEL==1 GOTO END
IF ERRORLEVEL==1 GOTO SKIP4
delay
AREV IDXCTS /X
:SKIP4
IF ERRORLEVEL==1 GOTO END
IF ERRORLEVEL==0 GOTO CYCLE
delay
:SKIP5
GOTO CYCLE
:END
=======
VOC IDXMCM that is called at login to the indexer account on one of our applications. This in turn starts a program to flush the indexes within the application.
=======
TCL
*TSR_CHECK
INDEX_FLUSHER
=======
INDEX_FLUSHER is the program that actually does the flushing for the application it's in. Once done, it logs out of the application which then give control back to the BAT file which then goes onto the next application.
=======
* FLUSHES SPINDEX AND BTREE INDEXES - RUN FROM MAPINDEX.LOGON
*
DECLARE FUNCTION GET.RECCOUNT
DECLARE SUBROUTINE INDEX.FLUSH
* FLUSH SPINDEX INDEXES **
IMG='
CALL MSG('²FLUSHING BTREE INDEXES²IN THE ':@ACCOUNT:' ACCOUNT²','UB',IMG,
) CALL INDEX_FLUSH_ASA("","") CALL MSG("",'DB',IMG,
)PERFORM "OFF"
END
At 14 FEB 2006 02:32PM [url=http://www.sprezzatura.com]The Sprezzatura Group[/url] wrote:
Weird - isn't it qualified with a *InstanceNumber?
World leaders in all things RevSoft
At 14 FEB 2006 06:52PM Matt Sorrell wrote:
What is the client OS?
I have seen on Win2k/WinXP desktops where *InstanceNumber (arev.pid?) is not incremented correctly.
In fact, I believe there have been several discussions on the forum regarding this.
msorrel@greyhound.com
At 14 FEB 2006 07:07PM Hippo wrote:
OK, dedicated indexer in your application is a batch process invoking arev routines to flush indexes.
In our case of one arev application it is just station with ran arev with indexing on background process.
What I want to show/ask by my code example is that when the table is resizing, list command can miss some records. During your testing process I suppose the table was resizing …
At 15 FEB 2006 09:53AM Michael Slack wrote:
Sorry, I wasn't clear. From your question, I assume the station number is a multi part value. The one @station gave to me in each case was 0*(and my desktop computer name). I assume the number to the left of the "*" should be different for each window I open into the AREV.
Thanks,
Michael Slack
At 15 FEB 2006 10:03AM Michael Slack wrote:
Were we are having problems is with AREV 3.12 on a 2003 Server. The server is loaded with Linear Hash 2.1 and All Network Drivers 2.1. My desktop work station machine is running Win XP with MS Window Ver. 5.1 Service Pack 2.
The @Station id number was the same in all cases I checked. That is the whole ID string (both sides if the "*").
Thanks for the tip, I try to find those discussions that you mentioned. Hopefully they may point me in the right direction.
Thanks,
Michael Slack
At 15 FEB 2006 11:15AM Victor Engel wrote:
Is there a file called AREVPID.DAT in the location where the arev.exe is located? And do you have write access to this file?
At 15 FEB 2006 11:40AM Michael Slack wrote:
Your point is well taken that if a Select statement and an Index Flush basically collide that there might be some problems. Our problems were both low and high counts. I can see a low count when some rows are missed. As for high counts, in my testing from TCL in trying to duplicate the problem, I made the mistake of not putting in a range of ID's. I didn't duplicate the report Select statement exactly. This gave me a number I had seen on one of my report's. This suggests that in these possible collisions that part of the Select statement can be dropped or ignored thus giving an incorrect count.
This brings up the question (assuming that this is the root of our problem), is there any way to prevent these type of collisions? Locking tables? Requiring whichever process that gets there second to pause or wait until the first process is done? The only thing I can think of right off the top of my head is to go thru the AREV configuration settings to see if there is something there that may help.
Thanks,
Michael Slack
At 15 FEB 2006 04:55PM Matt Sorrell wrote:
Michael,
There is an option (buried somewhere) to update indexes on query. This forces a flush of any pending transactions, but can slow the system down.
We have it disabled in our system, and in the few instances when we need to be absolutely sure the indexes are up-to-date we explicitly call Index.Flush().
I have a feeling this is turned off in your environment as well, and that turning it on might eliminate the problem. Keep in mind, however, that with this turned on any selects are at the mercy of the dedicated indexer. If it has an index locked, then the query will wait until it is unlocked so it can call a flush.
msorrel@greyhound.com
At 16 FEB 2006 06:41AM Hippo wrote:
Did you looked/loged only on row counts?
Logging list of ID's can help in finding problems (both for high and low counts) … if you are talking about 10 000 tested records, this cannot be such a big slowdown.
I hope following questions may be helpful for locating the problem …
Are the missing/doubled IDs in a small number of "bunches" with simillar index field value? … What were the sizes of .LK files of table and !table at the start and end of the report?
Backupping the table and !table at the start and end of the report may help, too.
At 17 FEB 2006 07:17PM Michael Slack wrote:
No, there isn't an AREVPID.DAT (or anything simular) file at the same level as the AREV.EXE. I tried creating it and even initializing it with 0 but no good. I've read thru all discussions I could find on it but I'm not much the wiser for it. There were a couple mentions that sounded as if they were having the same problem I'm experiencing.
Any suggestions on how I might go about correcting this?
Thanks,
Michael Slack