High availability systems (OpenInsight 32-Bit)
At 07 APR 2003 12:59:10AM Paul Rowe wrote:
We'd like to improve the availability of our system in large sites and there's are a couple of areas which may have been tackled before:
a. Conflict between system backup (tape backup or otherwise) with files in live use.
Logging the users out of a system each night to back it up can be a pain if the system is to remain live at the same time. One option would be to freeze the system and keep a transaction log of updates to apply once the backup is completed, but this is quite complex to maintain (and more complex if requests from the data need to be up-to-date and therefore resort to the log).
Has anyone else found other solutions to this problem?
b. Fail over protection for the Revelation service. Is it possible to have a second copy of the service running on a second server, with the data stored on a Storage Area Network? The principle would be to send requests through one service and server to the Storage Area Network, unless the server or service when down (in which case requests would get rerouted through the backup server). Possibly this is impossible, but we're curious as to whether anyone else had implemented any kind of fail over protection for the server.
Thanks,
Paul
At 07 APR 2003 06:22AM Oystein Reigem wrote:
Paul,
a.
I have no experience with backup systems, and backups of live systems. I just reply out of habit.
![]()
I assume there can be two different problems with doing backups from a live database. (1) Since the individual files of the database are backed up at different times, or while transactions are not finished, the backuped database files may have internal inconsistencies. (2) The app might not be able to access tables while the files are backed up.
With the kind of apps that I make I think I would do the backup not on the original data but on some form of copy of the data. I assume it's easy to tell a backup system to stay away from particular folders, so problem (2) could be avoided.
How up-to-date must the backup be? Let's say there is a particular time of day when users only occasionally are logged in, but unfortunately a different time than when the backup runs. (I assume the backup also runs at a particular time of day.) Could you have a process that kicks in at that silent period and makes a copy of the database to a different folder? That process could check if any users were logged in and ask them to log out.
Also there might be a possibility you could make that copying process robust enough to make a consistent copy of the database even if users are logged in. Perhaps somthing like a time stamp on your records could help that process decide which records to copy.
My suggestion, or some workable scheme of your own, might result in a backup that's consistent, but a bit older than desired. What then with completing the backup with a sort of transaction log. You mention a scheme with a transaction log, with the app keeping transactions on hold, waiting for the backup to finish. What I think about is something different, with the app doing all transactions, but also logging them to external files. Those files could hopefully be picked up by the backup process, or might be among those surviving a crash. After restoring the database from backup the latest transactions could be run to bring the database up-to-date.
- Oystein -
At 07 APR 2003 10:40AM Richard Hunt wrote:
I have experience with backups. Experience with actually needing the backup to repair damaged files. You will have to sacrifice the "complete" backup with a backup being done with users on the system at the same time.
Here is what my customer's and I have found. If the system needs to be accessed all the time, then you will be needing a backup software that only copies the files so as to not effect sharing. Also, pick a time of day when the system is least likely to have users logged in (or the fewest).
Although you must be clear that the actual backup can have problems since there are users on the system. With linear hashed files it is possible that the file can be "resizing" during your backup. If that happens (most unlikely), you might have a file that is unreadable (again very unlikely). Also if your software integrates data, some of the data might be in the middle of the integration process and exist only partially.
As an example, one of my customers does an automated backup at 2:00am every day. The actual backup process is an FTP (binary) from the server to a PC (unix to windows or windows to windows). It stores seven days of backup history. Now, normally in the morning, a user then is suppose to copy the backup from the PC to a rewritable CD (I think this is happening only once a week). That CD is then taken with the bank deposit and put in a safety deposit box.
The off site copy of the backup is so that if there is a fire or theft, the company has a backup that is safe.
Just want to note that using an automated backup means that you must have a user verify that the backup is actually happening. And I strongly suggest that you do not do incremental backups.
I am very wary of tape backups. They seem to be unreliable. The tape stretches, the tape device needs to be cleaned often. I choose to copy the data to another computer.
At 07 APR 2003 07:27PM Paul Rowe wrote:
Thanks for the feedback.
Yes, the two problems are the internal integrity of the data and the ability to access the files while the backup is processing.
The backup can be on data that is a few hours old, as by the morning the system will be newer by a few hours anyway
![]()
One problem with backing up a copy is that we'd need to verify that the copy was successful. If a file was in use then the copy would probably fail for that file. Backups can easily exclude a particular directory, so it would be possible to tell the backup software to just process the copy.
Backup software which can share the files, rather than lock them, is another option, although as Richard says the integrity of the backup copy is at risk. The largest file would be around 200Mb which would take a few minutes to backup - potentially the file could be changed while in use, but this is unlikely.
Gives us some things to think about anyway.
Cheers,
Paul
At 07 APR 2003 09:29PM Paul Rule wrote:
I've been trying to solve this age old problem for years now.
The one thing I do believe is that if you're going to backup data with users logged in then that backup will be suspect. If you know up front that your backup may be suspect then why bother even backing up.
The most recent idea I came up with is that you need to design your entire system to ALLOW it to kick off users so you can back up. It has to be done in such a way as to log them out nicely, such as at the end of the current process they are running so as not to corrupt anything. The monitoring process could just be run from a TIMER event.
eg: Its 3am, start the check and kick off routine. When all users are out then start the backup process.
At 08 APR 2003 05:47PM Paul Rowe wrote:
That could work. The only problems would be if someone left their computer on in the middle of something (pretty unlikely) or if there was a crash leaving files locks.
We could probably live with both of those.
Paul
At 09 APR 2003 11:12AM S Botes wrote:
Paul,
Years ago, circa 1979, American Airlines solved this problem by taking portions of the database offline long enough to copy it and then make it available again. I was not directly involved but remember that that was how they addressed backing up a 24/7 database. Down time for them was measured at $150,000 per minute in lost revenue in 1979 so keeping things running was a major effort. They basically disabled some functions for the amount of time that it took to copy that section. It worked quite well. Of course your environment may or may not fit but just a thought….
At 09 APR 2003 04:33PM David Kafka wrote:
There has got to be a solution for this out there. There must be some sort of "hot-swappable" mirrored system where you could temporarily disconnect one of the mirrors, back it up, and then when you reconnect it, it automatically resynchronizes.
At 09 APR 2003 06:43PM Paul Rule wrote:
Yes, the big problem is if the system is in the middle of a long process. Thats where the idea of designing the entire system to check for this sort of thing comes in. This is only at the idea stage at the moment, but I was think of something along the lines of a tracking system that was used so that every process that runs does a "request for processing" If allowed then sets a flag saying "process in progress" when that process is finished then turns that flag off. When its time to back up, the system will deny a request for processing, so when a process is finished a new one cannot start. The timer event also checks that the workstation has no process in progress and logs them out. When all stations are out then the backup kicks off.
At 09 APR 2003 08:12PM Bob Watson wrote:
We have an AREV system which runs an emergency medical callout roon 24/7. It can not come down. We use xcopy to a local drive - then back it up. This causes problems with locked out indexing files quite frequently. It's not a good solution.
The real solution, as sugested, is mirrored servers - one of which can be stopped for backup and then resyncronised before resuming similtaneous reading/writing. There may be a solution out there these days that does this. The reason we havn't investigated this is political (take over the company). If I had my way I'd be looking at this possible solution.
Bob Watson
At 10 APR 2003 06:47AM Oystein Reigem wrote:
Bob et al,
Excuse me - an ignorant and outsider to backup issues - for still taking an interest.
What's most important in a database system when restoring it after a crash? Of course in most cases it's vital to get the program itself back and running. But what about the database? How up-to-date does the data need to be? Having a healthy database one or two days old might be better than having a newer database with possible inconsistencies. That might even be the case for an emergency medical callout system, although I must admit I don't know the term and therefore don't know exactly what such a system does.
And what if the system can't be brought back on the air fast enough? For a system that must be running 24/7 I would think there should be some kind of …uh… backup system, i.e a different system that can be used until the main system is up again. Such a system could be manual or electronic.
A manual system could exist of recent printouts of vital data, and paper forms for filling in new data, to be keyed in once the main system is up and running.
A simple electronic system could be sort of off-line, based on computers that would not be affected by a crash on the main system, with recent log files or report files from the main system, and some program for keying in new data. The data entry program could be anything from a word processor or spreadsheet to a copy of the main app program. There should also be some functionality in the main system to import the external data once it's up and running.
A more advanced electronic system could be a full copy of the main system, located such that it would not be affected by a crash on the main system. That backup copy of the system might have a database updated at intervals from the main system backups. So the backup system would only be as fresh as the latest backup, but at least it would be up with full functionality fairly fast.
Finally - there might be ways to keep data from the time between the latest backup and the crash from getting lost. If every transaction was accompanied by logging the relevant data to a file, that file could contain enough information to bring in the latest changes. (That file must be on another computer than the main system's server.) This strategy would of course mean extra programming - both for continuous logging and for import/update. But even without the latter logging could be of value, since the log could be consulted manually.
- Oystein -
At 10 APR 2003 08:51AM Bob Watson wrote:
Oystein
This place has 15,000 clients, mostly senior citizens, who have pendants round their necks that they can press if they get into trouble. A device dials out and their information is presented, via a central reciever on one of many computer terminals. AREV sits waiting for calls to come in via the serial port of the computer. The operator can then talk directly to the client if the client is up to 50 mtrs away from their phone.
They have a hotsite that all the phones can be directed to at the flick of a switch. The hotsite has the same capacity as the main site and has data that is current at the last backup. They also have carousels containing printed details of all clients which is current up till the last evening, in case the computers go down but the calls are still being received. As well they keep logs of all changes between backups. They are shifting to mirrored servers off site via a high speed data link.
Took me 10 years to develope it to its current state but we support it only when they need us these days.
I think the mirrored server off-site has to the ultimate backup doesn't it?
Bob Watson
At 10 APR 2003 01:00PM Richard Hunt wrote:
Bob,
I was just thinking about mirrored drives. They are great for everything except for theft and fire etc.
One customer of mine has a mirrored "hot swappable" system. One drive went down and all we did was to pull the drive out. And that was while the system was running and users on the system.
Now once the repaired drive was installed, the repaired drive had to be mirrored again. That has to be done while no users are on the system.
When it comes to GFE's on mirrored systems… I have no experience. That customer has never had GFE's.
At 10 APR 2003 05:34PM Paul Rowe wrote:
A mirrored system sounds like a good possible solution. I guess the file on both systems are normally in use. Is it possible to pause updates to one of the copies while the backup is made and then automatically resync these on completion? (this was something suggested earlier in the thread)
Paul
At 11 APR 2003 01:39AM Donald Bakke wrote:
A mirrored system sounds like a good possible solution.
It sounds like you have this figured out, but just in case keep in mind that a mirrored system is a bit different than a traditional backup. For instance, if someone deletes a critical record or table then it gets deleted off of the mirrored system as well.
Is it possible to pause updates to one of the copies while the backup is made and then automatically resync these on completion?
My experience is with Novell systems. If you disengaged the mirrored drives for any reason they would be resynched automatically, although often with a slowed performance on the server for a little while.
dbakke@srpcs.com
At 13 APR 2003 07:40PM Bob Watson wrote:
Richard
I was think of mirrored servers off-site down a high speed link. That's what they are implementing at this place.
Bob Watson
At 14 APR 2003 11:18AM John Bouley wrote:
Paul,
Have you looked into Powerquests new drive imaging for Servers? It is supposed to allow you to create a drive image while the system is still up and running.
Also, I remember a product from St.Bernard that allowed for a backup of Open files. I believe this technology has matured and is now an option for most backup programs.
Please be aware that I have not used either of these solutions…
HTH,
John Bouley
At 27 APR 2003 08:16PM Paul Rowe wrote:
Hi John,
I've just got back from holiday and have looked at the two products you mentioned. Both seem very promising, and the St Bernard product (Open File Manager) has a Whitepaper describing the process which seems to be exactly what we are looking for.
The software takes a virtual point-in-time backup by saving a pre-write copy of any blocks of data updated while the backup is in progress. The backup then runs on the virtual copy that has been made.
Thanks for the suggestions.
Paul
At 28 APR 2003 08:32AM John Bouley wrote:
Hi,
Just do some test restores before deciding it works… I have not used the product myself but have often wanted to know if it lives up to its claims.
Let us know…
John