Technical Support, Instructions & Repair Service

Back to: 

Tags:

Acer Altos G510 Server

5.0.6 grinds to a complete halt


By Rogers - usenet poster


Running sco ose 5.0.6 on Acer Altos G510
System runs merrily for anywhere from 10-20 hours
Then, just stops
Can't log in
If a shell is running, it will still echo characters typed, but takes no
action
cron jobs dont execute
only solution is to hard reboot

It seems to me it must be completely pre-occupied doing something, such that
it totally ignores everything else.
But, no common thread as to when it dies, or what is running at that point.

Suggestions for how I can attack/diagnose what is going on?
--
Barry

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.659 / Virus Database: 423 - Release Date: 02/05/2004
I have the same problem.
This Problem has been added to the Share Your Expertise Page under "My Work Queue".

Solution #1

posted on Aug 10, 2005
Not Rated (0)

Odud

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
I have determined that these disks are in fact LVD disks

There are 5 Seagate 36GB hard drives attached to the ultra320 controller,
nothing else

To the best of my knowledge, (distributor put this together for me), they
used proper LVD cable

Barry

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.683 / Virus Database: 445 - Release Date: 22/05/2004
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #2

posted on Aug 10, 2005
Not Rated (0)

maartenw

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Bob, I'm remote from the site, and I didn't build the system, so I'll go in
search of the information you requested to be posted.
I can tell you that indeed there are only disk drives (5) attached
internally to the ultra 320 controller.  Configured as RAID 5, with the 5th
disk configured as a hot swap spare.
We have a TANDBERG SLR5 tape drive attached to the lsil adapter, but only
the hard drivers connected to the amird.  In the past, the people who built
this were in the habit of attaching all hard drives to the same channel, but
I don't know that for a fact (yet) in this case.
I will research the rest of the info you asked about and post.   I'm posting
the output from hwconfig here:

name=kernel vec=- dma=- rel=3.2v5.0.6 kid=2000-07-27
name=cpu vec=- dma=- unit=1 family=15
name=cpuid vec=- dma=- unit=1 vend=GenuineIntel tfms=0:15:2:9
name=fpu vec=13 dma=- unit=1 type=80387-compatible
name=pci base=0xCF8 offset=0x7 vec=- dma=- am=1 sc=0 buses=3
name=PnP vec=- dma=- nodes=0
name=clock vec=- dma=- type=TSC/2392283909Hz
name=scodb vec=- dma=- conskey=Alt-Ctrl-d siokey=^X
name=serial base=0x3F8 offset=0x7 vec=4 dma=- unit=0 type=Standard nports=1
fifo=yes
name=console vec=- dma=- unit=vga type=0 num=12 scoansi=1 scroll=50
name=floppy base=0x3F2 offset=0x5 vec=6 dma=2 unit=0 type=135ds18
name=kbmouse base=0x60 offset=0x4 vec=12 dma=- type=Keyboard|PS/2 mouse
(wheel) id=0x03
name=parallel base=0x378 offset=0x2 vec=7 dma=- unit=0
name=adapter base=0x170 offset=0x7 vec=15 dma=- type=IDE ctlr=secondary
dvr=wd
name=adapter vec=5 dma=- type=amird ha=0 id=7
name=adapter base=0xD800 offset=0x80 vec=11 dma=- type=lsil ha=0 id=7
Chip=1030 10327
name=bcme0 vec=9 dma=- chip=BCM5702 mem=FE560000 phy=BCM5703
addr=00:c0:9f:36:98:1b
name=RIO vec=10 dma=- Jet PCI @ 0xFE7E0000 : Drvr Rel 1.1.11
name=cd-rom vec=- dma=- type=IDE ctlr=sec cfg=mst dvr=Srom->wd
name=tape vec=- dma=- type=S ha=0 id=2 lun=0 bus=0 ht=lsil
name=disk vec=- dma=- type=S ha=0 id=0 lun=0 bus=0 ht=amird
name=Sdsk vec=- dma=- cyls=13361 hds=255 secs=63 fts=sdb
name=Stp-0 vec=- dma=- Vendor=TANDBERG Product= SLR5 4/8GB
#



I gather that your experience suggests to you, that my problem may well be a
cable/terminator issue?  That seems most likely based on your own
experience?
I haven't been able to pin this "hangup" issue on anything in particular.
There is evidence (based on some of the hangups) that more disk activity
(still not much, really) will bring on the hangup sooner-- but it has also
hung up with nothing that could reasonably be called "intensive" disk
activity.
The fact that it operates so slowly (i.e., 4 times slower than the 5 year
old -- also RAID5-- server that it replaces)-- does that fit into your
theory, as well?  (Bela thinks these may well be two separate issues--the
hangups, and the slow performance)

Killing off the hog
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #3

posted on Aug 10, 2005
Not Rated (0)

Phoebe

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
My apologies to Rob, for not answering his question.  I was just
reviewing the entire thread when I realized this.

 LSI SCO Unix Driver Version 2.18-00 Loaded

is the driver we are using.  It didn't seem to be a problem-- but
maybe I shouldn't state that.  Actually, it never even crossed my mind
until this moment, that maybe the driver could be at fault here?  I
have a couple of other sites, almost identical, that were installed 14
months ago, and 8 months ago, respectively.  In both cases, version
2.14 was used.  Hmmmmm....

As regards your other point, re email clogging this system, that
wasn't part of the syndrome.  It was just plain old horrific
performance.  I found several other postings on the web, indicating
other people had experienced the same thing as I did, in terms of
horrible performance of amirdmon, on 5.0.6, and lsil has acknowledged
it as well, by releasing version 1.05.  Well, they haven't exactly
released it .. if you download from their site, you get the bad
version.  If you call them up, and say "pretty please", they will send
you the 1.05 version.
I still haven't installed the new version, feeling I wanted to keep
the amirdmon out of the mix, whilst in search of an answer to my
meltdown scenarios.

Regards
Barry
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #4

posted on Aug 10, 2005
Not Rated (0)

Reynolds

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
The trickery is to use a serial console, then you can use a comm
program's "log the session" feature.

I did suggest leaving user-level scodb running on an ssh/telnet session,
did you try that?  That would give you complete remote control (I don't
think you need to set breakpoints), and the ability to log everything...
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #5

posted on Aug 10, 2005
Not Rated (0)

Cornish

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
There's a pretty easy way to determine if the disk subsystem is hung that
doesn't involve the debugger: If you have an LED connected to the SCSI
controller, it will be lit solidly, but none of the disks' LEDs will be lit. The
controller is hung waiting for a command response that never times out.

If you could, please post the structure of your SCSI subsystem, including
all devices attached, on which channel(s), and whether they're LVD or
single-ended. The fact that it's an ultra320 controller should mean that
you have nothing but 3 or more ultra320 LVD disks attached internally,
using an approved LVD cable or externally using an approved LVD enclosure.

FWIW, this happened to our system just an hour ago when I attempted
to perform a full backup using a DDS-3 tape that should have been
discarded months ago. The Sony SDT-11000 drive we use has a bad
habit of hanging up the bus when it has to deal with too many write-retries
or errors. I hope to move it to its own 2940U2W controller shortly;
right now it's by itself the 2nd channel of an Adaptec 3210S RAID
controller, with the disks being on channel 1. But I've run into this
hanging problem before with other RAID controllers, and it's always
been due to less-than-perfect cables and active terminators when I
was using wide, single-ended SCSI devices. Now that I'm using LVD
devices, termination is not an issue, but less-than-perfect tape drives
still are.

Bob
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #6

posted on Aug 10, 2005
Not Rated (0)

Bray

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
I can't really think of anything for this that wouldn't involve access
to the "amird" driver source, which I doubt LSI Logic will give you.

You can set up a C program (or even a shell script) to further confirm
what I've said.  For instance, as a shell script:

  #!/bin/sh
  #
  # "disk-watch"
  #
  testfiles=
  for file in /etc/default/*; do   # obscure files that won't be in cache
    [ -f $file ] && testfiles="$testfiles $file"
  done
  set --
  while :; do
    read line
    echo "You typed: <$line>"
    case $line in
      *hey*) [ $# -eq 0 ] && set -- $testfiles
           testfile=$1
           shift
           echo "Trying to read from $testfile..."
           read foo < $testfile
           echo "Got <$foo> from $testfile";;
      *quit*) exit 0;;
      "")  echo "Type <hey> to check if the disk is working, <quit> to quit...";;
    esac
  done

Leave that running.  When the system gets into the apparent hang state,
first confirm that you can still interact with the shell process, by
typing in some random junk.  It should echo "You typed:" and what you
typed -- showing conclusively that the script is still running.  Then
type "hey", and it will try to read from a file that probably hasn't
been accessed in recent memory.  If that hangs, it's pretty strong
evidence that the disk subsystem (or at least the part that gives access
to the root filesystem) is out to lunch.

Let's see... stuff you can do in scodb that isn't conclusive, but very
suggestive...

Look at ps() output.  The 9th column (WAITCHAN) shows what each process
is waiting on.  During normal operation this will be all sorts of
things.  After the disk hangs, each process that gets stuck waiting for
disk I/O will probably be waiting on a similar thing (not necessarily
identical).  They might all be the same symbol name+offset; same symbol
+ varying offsets; or might be hex addresses that scodb can't decode to
a symbol, and the addresses are all close together.

In any case, once the disk is hung, there will be a certain number of
processes that have already run aground, and others that haven't.  If
you're running the above "disk-watch" script and haven't yet said "hey"
to it, it should show up one way in ps().  It'll probably be waiting on
some offset relative to "spt_tty".  Afterwards it should show that it's
waiting on something different.  Similarly, the console getty processes
will generally be waiting on offsets of "cn_tty".  Have someone on-site
flip to a few different multiscreens, type in usernames and hit <Enter>.
Each process will hang attempting to exec `login`; they'll end up with
similar or same WAITCHANs, different from before.

You should be able to characterize those hanging-process WAITCHANs.
Something like "they're all waiting for amird_lock" would be an obvious
smoking gun.

Next, look
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #7

posted on Aug 10, 2005
Not Rated (0)

Beresford

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
This seems like it should have a simple answer, but it doesn't.  It's
compounded by two operating systems, multiple versions, local and remote
docs, different doc servers, in-kernel and user-level versions of the
debugger, and reference vs. guide docs.  Whee...

There are two basic documents: the SCODB User's Guide (in various
forms), and the scodb(ADM) man page.

The man page is easy and short: it just describes how to invoke the
user-level scodb.  All usage information is deferred to the User's
Guide.  To reach the man page:

  (local)   man scodb
  (remote)  http://docsrv.sco.com:507/en/m an/html.ADM/scodb.ADM.html

The guide comes in various editions.  It's been changed slightly over
the years to reflect slight improvements in scodb.  Also, scodb was
ported to UnixWare, so there are a couple of UW7 editions of the guide.
On UW7, scodb is a kernel-level debugger only (no user-level
/etc/scodb).

OSR506 systems should have a local copy of the guide, plus you can
access the same edition remotely:

  (local)   http://localhost:457/SCODB/CON TENTS.html
  (remote)  http://osr5doc.ca.sco.com:457/ SCODB/CONTENTS.html

OSR507 systems should have a local copy of the guide, plus you can
access the same edition remotely:

  (local)   http://localhost:8457/en/SCODB /CONTENTS.html
  (remote)  http://docsrv.sco.com:507/en/m an/html.ADM/scodb.ADM.html

Note that the editions shipped with 506/507 actually apply best to
OSR504!  To get a guide that is actually up to date for 505/506/507, you
need to get it from the Hardware Developer's Kit (HDK) docs, available
online at:

  (remote)  http://docsrv.sco.com/HDK_basi cs/CTOC-scodb_top_osr.html

But don't worry, the actual changes are minuscule.  Basically, this
section is added:

" Version 2.4.0 of SCODB is for the SCO OpenServer Release 5.0.5 kernel.
" The following features are new to this release:
"   * Automatic creation of the stundef and vardef files for the
"     user-level scodb command.
"   * The boot code is modified to show when scodb is configured and the
"     current DBKEY value that determines the key strokes used to invoke
"     kernel-level scodb.

The UnixWare 7 edition of the SCODB User's Guide was originally
presented in guide form, up through UW 7.1.0:

  (remote)  http://docsrv.sco.com/HDK_basi cs/CTOC-scodb.intro.html

After that, it was merged into a single man page, scodb(1M).  Unlike
OSR5 scodb(ADM), scodb(1M) is the
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #8

posted on Aug 10, 2005
Not Rated (0)

Powe33

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Bela, thanks so much for your clear and concise explanation of how a
disk-susbsystem hang would reflect in how ose 5.0.6 behaved.  I thought I
was going (gone?) nuts, when it would echo characters, but do nothing.  (I
actually left my commands "running" for 8 hours at one point, figuring that
SOMETHING would finally happen-- not)
Also, thanks  for all the additional info you have given on the use of the
debugger.
Now-- at the risk of wearing out my welcome here-- can you give me any
direction as to what I might be able to determine, with the debugger, to
confirm the theory that it is the disk subsystem that is hanging?  I would
have thought there would have been some sort of error message, or timeout,
if it just hangs.  Clearly that is not the case.  Would I look for something
like a process that I can hopefully identify as disk IO, just sitting there
watching?
Thanks again
Barry



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.683 / Virus Database: 445 - Release Date: 17/05/2004
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #9

posted on Aug 10, 2005
Not Rated (0)

Perkins

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
For the time being, at least, I won't be able to use the "buddy system".
Meaning, I will be asking the customer to go to the system console, and do
Ctrl-X, to break into the debugger.  I don't suppose there is any trickery
whereby we could capture
the standard output from the debugger, so that I can post it?
(Didn't think so-- but thought I would ask, anyway)

Thanks
Barry

.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.683 / Virus Database: 445 - Release Date: 17/05/2004
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #10

posted on Aug 10, 2005
Not Rated (0)

Hart

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Yes.  Oh, there may be some cases where a disk hang would stop the
kernel as a whole, but that would be unusual.  You've already described
symptoms where the kernel is clearly still running (you can type and
edit an input line); under those conditions, scodb will definitely be
helpful.

OpenServer includes two different items named "scodb".

There's the kernel debugger driver, enabled by putting 'Y' in
/etc/conf/sdevice.d/scodb, relinking the kernel and rebooting.  To get
into that, you must be sitting at the system console.  Note that the
console can be either the local video card multiscreens, or a serial
port.  I like to use something I call the "buddy system", described on:

  http://groups.google.com/group­s?se... @vagabond.armory.com

scodb only supports standard ASCII over a serial port, so it has no way
to observe a sequence like Ctrl-Alt-D; the default debugger key on a
serial port is Ctrl-X.

There's also the scodb(ADM) command (/etc/scodb).  This is basically a
recompilation of the scodb driver as a user-level program.  It can read
and write kernel memory, but cannot affect flow of control (for instance
it can't set breakpoints nor be invoked when a breakpoint -- or panic --
is hit).  It can also be run to examine a saved crash dump.

If it's linked in, the in-kernel debugger can be invoked at any time
from the console, regardless of who's logged in (including nobody).
Exception: you can invoke it from a graphical screen like X, but it
doesn't do anything to put the screen back in text mode, nor display
itself in graphics.  You can do simple interactions (like "whoops, I
just hit Ctrl-Alt-D and the whole system hung ... type 'q' Return ...
ah, we're back).  There may be a similar situation on a serial port, but
it's much more obscure: OSR5 supports a "scan code mode" where, with a
suitable terminal, programs _can_ see things like Ctrl and Alt being
pressed and released.  I've never tried scodb under those circumstances,
but I bet it would go wrong.  Hardly anything uses scancode mode.

In order to run user-level scodb, you must be logged in as root.  This
would be impossible if the disk subsystem had already hung.  However, if
you were _already running_ scodb by the time of the hang, you might get
away with it.  So experiment with that -- open an ssh session to the
machine, run `/etc/scodb -w`, leave it that way until the hang happens.
See if you can interact.  Two possible impediments: (1) if memory is
overcommitted, pages of your idle process will have been pushed out to
swap and it'll hang trying to retrieve them; (2) scodb may not have
faulted in all the pages of its own executable image.  #2 could be a big
problem.  To stack the deck in your favor, when you first start the
session, briefly use all the scodb commands you're likely to use later.

You mentioned that you were coming in re
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #11

posted on Aug 10, 2005
Not Rated (0)

Hart

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
In article <mKgoc.43839$n7P1.28 @twister01.bloor.is.net.cable.­rogers.com>,

Just one little comment here:

That is typical of many Unix systems I've used over the years.

You think the machine is alive, but it's only the keyboard/display
giving you hope :-(.   Nothing you type is making to the part of
the system that does any good.

On some systems [I've not tried this on SCO] I've been able
to perform a remote login to give me enough control to reboot
the system without having just to hit the power switch.

And the ping is also mis-leading.  I've also seen this on various
Unix/*n*x systems, and while you can ping everthing else is totally
dead.

I just pass this on as it's not unique to SCO.

Bill

--
Bill Vermillion - bv @ wjv . com
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #12

posted on Aug 10, 2005
Not Rated (0)

Duke

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Perfectly consistent.  OpenServer is very conservative about swapping;
it never pushes process pages out to swap unless it's out of memory.  On
modern systems this generally means that swap is never touched.  Thus,
any active process resides entirely in memory.  Also, the kernel itself
is all hard-loaded in RAM -- none of it is pagable.  If the disk
subsystem hangs, the kernel continues to function.  Each individual
process continues to function until the first time it tries to access
the disk.

For instance, the program that provides the login prompt (`getty`, for
console ttys) will continue to accept and echo characters.  If you hit
return on a name, it goes to exec `login`, which involves disk access,
so you never get to the password prompt.

If you're sitting at a shell prompt, you can type; you can run internal
commands like "echo foo"; but any attempt to run a binary will hang.
(Even if the binary is fully cached, its access time needs to be updated
on disk.)

It isn't particularly weird.  What you're describing is a fairly
standard set of symptoms for a variety of conditions including SCSI bus
timing, parity or signal integrity problems; internal errors in a disk
drive; and so on.  You might rightly expect a RAID controller to be a
bit more thorough about error recovery, but apparently this particular
one -- in this particular failure case, whatever it is -- isn't.  

You also mischaracterizze the situation here.  It _isn't_ performing
absolutely normally.  It's running 6 times slower than older and
presumably much slower machines.

But I bet the two symptoms are actually unrelated, and you have two
separate problems to solve.  (1) complex application jobs run much more
slowly than expected; (2) the disk subsystem occasionally hangs.



These are good questions... I'll post a second reply as a separate
subthread, because I'm going to include some research results that are
worth archiving permanently under a sensible subject line.
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #13

posted on Aug 10, 2005
Not Rated (0)

Charlie

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
It appears I declared victory a little too early.

Killing the amirdmon process did indeed have salutory effects on the
performance.  Customer stopped reporting noticeable slowness in system
performance.

As noted above, file copy type jobs were still 3-4 times slower than
the 5 year old server.  However, the server did run for a full week,
before back-sliding yesterday.  Again, nothing going on that I can pin
it on.

I'm now inclined to theorize that Bela's suggestion is correct-- that
the disk (RAID 5) has stopped responding completely.  Would that be
consistent with the reported behavior? i.e., if you are in a shell,
you can type characters, and they echo, and you can do a carriage
return-- but nothing is ever executed?
Also, this seems major-league weird-- that the system can perform
absolutely normally, all the time-- except once in a while it loses
contact with the disk?

Some questions re the debugger- which I have now configured.
If the disk has stopped-- am I likely to get anything back from the
debugger?
I assume this can only be run from the system console-- I can't do it
remotely?
I imagine that, in order to get info from the debugger, root must
already be logged in, and sitting at # prompt?
I am trying to experiment with the debugger in advance of the
freeze-up, to try to get a little bit familiar with it:
i)  if I hold CTRL-ALT-D - it just logs me out, as if I had pressed
CTRL-D
ii)  I can load scodb, from shell prompt
    If I enter "stack" command, I get
When operating on /dev/mem, you cannot examine the stack of the
current process.  The "stack" command must be used with the "-p"
argument.
If I enter "stack -p", I get the same message

Can someone point me to documentation on scodb?  man scodb makes
reference to the SCODB User's Guide.  I thought I had a complete set
of manuals- but I don't have that one.

Thanks again for any suggestions
Barry
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #14

posted on Aug 10, 2005
Not Rated (0)

Bomber

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Looks like I have gotten close to the problem here.
It is apparently the amirdmon program that is the cause of this mischief.
This is for lsil megaraid 2 channerl U320 SCSI RAID controller, with 128mb
cache
The amirdmon program is the "monitor" program, set up as /etc/amirdmon, to
watch for, and report hard drive failures.
I killed the program this morning, and performance immediately improved.   I
called lsil support, and they have sent me amirdmon v 1.05 (to replace the
v1.04 that was supplied with the Raid controller)
Too soon to tell if all my problems are history, but it does look better.
One disconcerting fact:  Before killing the amirdmon, I ran the same job on
the new server, and on the 5 year old Acer Altos 9100 (also with RAID 5)
server that it replaced.  It took 6 times as long on the new server!
After killing the amirdmon, I ran the job again-- now it only takes 4 times
as long as the old server.  Clearly something else is still not correct.

To answer your questions:
System is remote, so I'm not able to observe disk light.
I could in fact ping the system, while it was hung
Flipping screens on the console did work-- sort of.
i.e., user sees the login prompt, he can type his login,
and it echoes the characters that are typed.
But, then wait forever for password prompt-- never happens.

Thanks for the tip on the debugger.  I am optimistic that getting rid of the
amirdmon will avoid the hangup again-- if I'm wrong, I will post the results
you suggested.
Thanks for your help

Barry



---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.673 / Virus Database: 435 - Release Date: 05/05/2004
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #15

posted on Aug 10, 2005
Not Rated (0)

2Pansy

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
If memory serves, the amirmond caused a complete hang of the server due
to a huge amount of email messages generated on the system.

Killing the amirmond daemon might (or might not) be a good idea since
you still don't know the reason of the above messages. Try by having a
look at the /usr/spoo/mail/root file to see if the above slowdown is
really email based.

Out of curiosity, are you able to use the latest/greatest driver from
LSI logic ?

Best,
Rob
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #16

posted on Aug 10, 2005
Not Rated (0)

Riddle

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
I have seen a similar symptom, it was started by a run away program.  The program
would run fine for months, but for some reason would randomly grow out of control.
At the point it grew to using all real memory, the system would slow down to a crawl.
The next step was the consumption of all swap space, which left the system _hung_.
Finally the program would dump a huge core file, again appearing _hung_ for the
duration.

By luck, this happened on a weekend once.  This system did recover after dumping
the core file, but missed cron jobs and the sar report was messed up.  By best
guess it took over 24 hours to finish.  The only hint was it looks like an
application program attempted to spool a print file larger than 2 Gbytes, and
the lp software failed.

How much memory does the system have, and what is swap configured to?  What is
the disk arrangement?

Mike

--
Michael Brown

The Kingsway Group
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #17

posted on Aug 10, 2005
Not Rated (0)

Ranny

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
On Thu, 06 May 2004 21:23:58 GMT, "Barry Swane" <bsw @rogers.com>
wrote:

I did this to myself once.  I forgot to turn off the "green" functions
in the bios.  When idle, the machine would go into power save mode.
The problem was that it didn't recover, and stayed permanently in the
power save mode, even when I pounded on the keyboard and mouse.  I
don't recall the exact symptoms, but the ability to echo characters
without executing anything sounds familiar.  Anyway, check the bios
settings.  

Some disk drives and controllers also have a "power save" feature.
Ancient Fujitsu drives were one of these.  The default jumpering
disabled the feature, but jumpers have been known to fall off.  Many
laptop drives also have a power save feature, but I suspect your
server isn't running 2.5" laptop drives.

You might wanna enable sar for 24 hour logging and look at the results
when it halts:
  http://www.LearnByDestroying.c om/sco/sar24hour.txt
If it's running out of resources, sar will offer a clue.  Similarly, I
suggest you inspect the files:
  /usr/adm/syslog
  /usr/adm/messages
and see if there are any failure messages.

--
Jeff Liebermann    j @comix.santa-cruz.ca.us
150 Felker St #D   831-336-2558
Santa Cruz CA 95060    AE6KS
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #18

posted on Aug 10, 2005
Not Rated (0)

Jimmy NY

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
That just means that the system is waiting for disk I/O.  Since the CPU
is hundreds or thousands of times faster than the disk, of course the
CPU is going to be waiting while disk I/O completes.  You can "improve"
these numbers by buying faster disk equipment, but before you do that,
ask yourself whether disk performance is a problem anywhere real, or
whether you're just reading a number that looks bad.

I previously posted:

" Any disk activity?  This sounds like a hard disk hang.  Look for
" permanently-on hard disk light.  Or permanently-off (but that's less
" instructive, you can't tell if the drive is hung or just not being asked
" to do anything).
"
" Can you ping the system from elsewhere?
"
" Can you flip multiscreens on the console?
"
" Turn on the kernel debugger ("Y" in /etc/conf/sdevice.d/scodb, relink,
" reboot).  When ground to a halt, break into scodb (Ctrl-Alt-D on a text
" console screen).  Give it the command "stack" to get an idea of what's
" going on.  "q" to quit, then break in again, do the same thing -- see if
" it's always doing the same thing.  Post one or more sample stack traces
" (one of each unique one, if there aren't too many).

Your new post convinces me of this diagnosis.  You're running `sar` when
the disk hang occurs.  Before the hang, %wio is about 75%.  When the
disk hangs, any processes that try to do disk I/O hang as well, waiting
forever for the disk to respond.  It doesn't take long before there are
no processes that can do anything interesting; no CPU time is used,
because all the processes are waiting for the disk.  That causes %wio to
read 100%.

The "100%" reading doesn't mean that the disk is super-busy.  In this
case it means that it has stopped responding.

So.  What kind of disk is it?  How is it attached?  If it's SCSI, you
probably have a termination problem.  If it's IDE, replace the current
doorstop with a working disk.
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #19

posted on Aug 10, 2005
Not Rated (0)

Janice

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Describe this machine?

Any disk activity?  This sounds like a hard disk hang.  Look for
permanently-on hard disk light.  Or permanently-off (but that's less
instructive, you can't tell if the drive is hung or just not being asked
to do anything).

Can you ping the system from elsewhere?

Can you flip multiscreens on the console?

Turn on the kernel debugger ("Y" in /etc/conf/sdevice.d/scodb, relink,
reboot).  When ground to a halt, break into scodb (Ctrl-Alt-D on a text
console screen).  Give it the command "stack" to get an idea of what's
going on.  "q" to quit, then break in again, do the same thing -- see if
it's always doing the same thing.  Post one or more sample stack traces
(one of each unique one, if there aren't too many).
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #20

posted on Aug 10, 2005
Not Rated (0)

Odud

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Further to my original note:
It appears that the issue must be a disk bottleneck
sar -u output, while running a file copy, gives steady value of 75% value
for %wio
can anyone suggest a solution to improve this performance?

Worst problem is-- %wio actually hit 100%-- whereupon the system stopped
functioning, as desribed in my original memo.
Any thoughts on what could occassion a 100% wio?
Apparently no way out, once that has happened?

Barry
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #21

posted on Aug 10, 2005
Not Rated (0)

Joey2

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
U??ytkownik "Barry Swane" <bsw @rogers.com> napisa?? w wiadomo??ci
<--- cut ----->
Have you register your SCO?

Rgds
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #22

posted on Aug 10, 2005
Not Rated (0)

Cato

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
Tomek enscribed:
|
| U?ytkownik "Barry Swane" <bsw @rogers.com> napisa3 w wiadomo6ci
| | > Running sco ose 5.0.6 on Acer Altos G510
| > System runs merrily for anywhere from 10-20 hours
| > Then, just stops
| > Can't log in
| <--- cut ----->
| Have you register your SCO?

For how many times.......

Registration on 5.0.6 and earlier has no effect on the operation of the
system.
--
==============================­==============================­==============
 Tom Parsons                   t @tegan.com
==============================­==============================­==============
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Solution #23

posted on Aug 10, 2005
Not Rated (0)

Cato

Rank:Apprentice Apprentice
Rating: 0%, 0 votes
There are thousands of unregistered SCO machines (and no doubt will be
many more since they made it so much more difficult in 5.0.7 if the
machine isn't Net connected)

Lack of registration is unimportant, harms nothing, affects nothing
other than nag messages.

--
Tony Lawrence
http://aplawrence.com/SCOFAQ
Was this solution helpful? Show your Appreciation by rating it:

Thank You!

Was the solution helpful?
Show your appreciation by commenting on 5.0.6 grinds to a complete halt:


I don't want to Accept this solution

Can you Help with these Servers problems?

Servers
Hi Lesmor, My problem is about winxp installation. I knows...

Servers
i have a usb 2.0 pci card and have just purchased a peak...

Servers
i keep getting these pop ups that say can not connect to...

Servers
cannot download PDF files as it appears corrupt

Servers
the browser logs on for a second then ablank screen with...

Loading problems.

Repair Service


When the original poster rates a solution that was given to his own problem, that rating is locked!
X

Are you sure the solution content is Inappropriate?
   
<