Notes / Domino Server crashes and fixup woes
by jcf. Average Reading Time: less than a minute.
I’m sitting in front of two clustered Notes servers that have died on me when I tried to diagnose a problem using the OpenLog log database. I added some unsupcious looking code to circumvent a problem that of course failed. In order to get more informtation on what failed, I decided to use OpenLog to pinpoint the error.
The moment, the agent ran with OpenLog enabled the Domino 6.0.4 server died. The following restart had it hanging on a consistency check. Killing the server with nsd -killand a subsequent nfixup -j -S c:\domdata brought up a lot of “interesting” error messages:
Hardware/OS error () writing to database (c:\Domdata\xx\xxxx.NSF),
Length=5B8 (PID=2D8/TID=2)
This is on a Windows 2003 Server.
Has anyone got any clue of what is going on?
(In the meantime I have managed to kill the clustered backup server in the same manner and I’m looking at it fixuping it’s databases before I can restart it)
It’s going to be a long night….

That’s terrible. I hate to hear that. One possibility for the crash could be:
“SPR# FMEG63H9ND – Fixed a Server crash at LSsObjMgr::UnLinkModule. There were no common symptoms for this crash – the only possibility is the use of nested script libraries. This regression was introduced in 6.5.1 and has been fixed in 6.0.5, 6.5.4, 6.5.3 FP1 and 6.5.2 FP1.”
It doesn’t seem like that would cause you to start getting “Hardware/OS error” messages on a database all of a sudden, though. If there’s some kind of database corruption going on (and it’s happening to multiple databases), that sounds almost like a drive controller or RAID array problem.
Was the OpenLog database that you were trying to write to local to the server or on another server? Maybe there was corruption on the database itself (just like on the other databases) and the server crashed when it tried to write the error message to a corrupted logging database.
Did the agent work properly on your test server? What kind of agent was it: WebQueryOpen, scheduled, triggered? Are any of the databases encrypted or in a compressed folder or anything?
Julian, that SPR looks like it could be the clue. I “use “OpenLogFunctions” ” in another ScriptLibrary (the one that is giving me headaches).
Luckily we got “nfixup” to run on most of the database – and further luck – the second server was crashing, but didn’t corrupt anything. So we deleted one directory of db’s one the first server, got it back up and running and replicated the whole directory back from it’s peer. That seemed to work fine.
After getting the servers up and running, and getting out of the door, I devised a way around the problem which will change some data, and not the code of this legacy application I’m dealing with. I’m confident that the system will be back to normal when the users arrive in 2 hours ;-)
And I’ll tell the war story of what external change (our city changed phone dialing codes from 01 to 044), what internal change in a companies user db and what that caused on the application I’m maintaining another time, when there’s more time…
Thanks for the clue!
Another possibility:
http://www-1.ibm.com/support/docview.wss?rs=463&uid=swg21206012&ca=lsdom