Troubleshooting System Hangs

Subject : Troubleshooting System Hangs

Description :

system hangup 관련하여 정말로 system이 hang되었는지 아닌지를 판단하는 대는
다소 어려움이 따를 수 있다. 때때로 단지 어떤 application이 hang되었는데도
system이 행업이 된것처럼 보일수 있다. 이제 system이 정말로 행업이 되었는지
아닌지 그리고 어떻게 진단 할수 있는지 또는 무엇을 확인해야 되는지를 살펴 보겠다:

당신은 system에 rlogin 또는 telnet으로 접속할수 있는가?

system이 ping에 대해 반응이 있는가?

window에서 mouse가 움직이는가?

system에 최근에 어떤 변화 또는 수정이 있었는가?

행업이 자주 발생 했는가?

행업이 발생할 때의 환경 또는 징후는 무엇인가?

Can the hang be reproduced on command?

What is necessary to get out of the hang (i.e. can the
machine be L1-A'd)?

CHECKING FOR A RESOURCE DEPRIVATION HANG
----------------------------------------

대부분의 system hang의 원인은 더이상의 resource를 제공받지 못하는 데 있다.
그런 사실을 판단하기 위해선 먼저 performance tool을 run하므로써 살펴볼수 있다.

다음은 system의 CPU bound, I/O bound, 또는 memory bound가 어느정도 인지를
알아보기 위해 15 분 간격으로 cron으로 돌린 shell program file의 예를
들어 보았다.

date >> file1
vmstat 30 10 >> file1
date >> file2
iostat -xtc 30 10 >> file2
date >> file3
/usr/ucb/ps -aux >> file3
date >> file4
echo kmastat | crash >> file4
date >> file5
echo "map kernelmap" | crash >> file5

CPU Power
---------

vmstat command의 output에서, 첫번째 column의 run queue size가 어느정도
인지를 살펴라. 만약 CPU당 3개 이상의 process run queue를 갖는다면(i.e.
하나의 CPU system당 3개, 즉 2개의 CPU system이라면 6개. 등등) 그 시스템은
주목해야한다. 만약 run queue가 CPU당 5개 이상이라면 CPU power가 부족한
것이다.

Virtual Memory
--------------

만약 vmstat swap column의 수치가 계속 내려가면서 회복이 안된다면 memory 부족의
가능성을 생각해 볼수 있다. kernel memory 부족을 판단하자면 두개의 crash command
의 결과를 살펴보아야 할 것이다(이에 대해선 다음에 기술하겠다). application의
사용자 memory 부족을 판단하자면 ps command의 SZ column을 살펴보아야 한다.
이 column은 process의 data 와 stack 의 size를 kilobyte로 표시한다.

만약 vmstat swap column의 수치가 계속 내려갔다가 회복이 된다면, swap column의
가장 낮은 값을 기억하라. 만약 그 수치가 4000 이하라면 virtual memory space의
소진을 우려하여야 할 것이다. system에 swap space를 늘려주어라.

Physical Memory
---------------

vmstat의 sr (scan rate) column은 현재의 process를 위해 필요한 page를 찾기위하여
scanning된 page의 비율이다. 만약 이 수치가 200을 넘는다면 physical memory가
부족한 것으로 판단할 수 있다.

Kernel Memory
-------------

kernel은 kernel data allocation을 위해 사용되는 memory의 영역이다.
이 memory는 보통 kernel heap라고도 불린다. kernel heap의 maximum size는
machine architecture와 physical memory의 크기에 달려있다. 만약 machine이
더이상의 kernel memory를 사용하지 못한다면, 보통 행업을 발생시킬 것이다.
crash kernelmap command의 결과로 kernel memory 오류를 알아볼수 있을 것이다.
이 command는 kernel memory의 segment가 얼마나 되는지 그리고 각 segment의
크기가 얼마나 되는지를 알려준다. 만약 단지 한두개의 page segment만이
남겨져 있다면, 더이상의 kernel memory가 부족한 것이다(설사 원래 page segment가
많이 있을지라도).

The crash kmastat command shows how much memory has been allocated to
which bucket. Prior to Solaris 2.4, this showed only 3 buckets making
it difficult to tell which bucket was hogging the memory (if any).
Starting with 2.4, kmastat breaks memory allocation down to many
different buckets. If one of these buckets has several MB of memory,
and there are kernel memory allocation failures, there is probably
a memory leak involving the large bucket.

In order to diagnose a memory leak problem it is possible to turn
on some flags in Solaris 2.4 and above (see SRDB 12172). With
these flags turned on, once the kmastat command shows significant
growth in the offending bucket, L1-A should be used to stop the
machine and create a core file. SunService can use this core file
to help determine the cause of the leak.

Disk I/O
--------

To check for disks which are overly busy, look at the iostat output. The
columns of interest are %b (% of time the disk is busy) and svc_t (average
service time in milliseconds). If %b is greater than 20% and svc_t
is greater than 30ms, this is a busy disk. If there are other disks which
are not busy, the load should be balanced. If all disks are this busy,
additional disks should be considered.

There is no direct way to check for an overloaded SCSI bus, but if the %w
column (% of time transactions are waiting for service) is greater then 5%,
then the SCSI bus may be overloaded.

Information about what levels to check for the various performance statistics
is taken from "Sun Performance and Tuning" by Adrian Cockroft, ISBN 0-13-149642-5.

Additional performance gathering scripts can be gotten from Infodoc 2242
for Solaris 2.x and Infodoc 11365 for SunOS 4.x.

GENERATING CORE FILES
---------------------

If looking at the performance statistics is not enough to diagnose the
problem, it is necessary to get a core file. Infodoc 11837 describes
how to do this.

If it is not possible to get a core file, then the situation is called
a hard hang. Contact SunService for information on diagnosing hard
hang situations.

Analyzing system hang core files
--------------------------------

Once a core file is obtained, the first information to look at is a
threadlist generated by the adb command.

$adb -k unix.NUM vmcore.NUM | tee threadlist.NUM
physmem xxxxx
$