Tuesday, October 19, 2010

-- CRS will not start on one node (Fixed on 11.2.0.2 and Workaround below) --


: Hdr: 9469133 11.2.0.1 PCW 11.2.0.1 CSS PRODID-5 PORTID-197
Abstract: CORE DUMP OF OCSSD.BIN WHEN VOTING DISK IS NOT ACTIVATED.

Some explanation from development:
==================================
 In 11.2, voting files are discovered, not hard-wired, so we look through a
 list of files that are specified in the 'discovery string', e.g.
 /dev/vdisk/*, and use all files that appear to be legitimate voting files,
i.e. they have a TOC (Table Of Contents), volume info block, etc.  Since the
VG with the voting files is not online, the discovery does not see them at
 all, so does not consider them as voting files and fails as a result of an
inability to find enough voting files.

=====================================

The workaround is to wait the disks are completely ONLINE after machine boot before cluster is started
(note 459169.1 CRS Does Not Startup Automatically After Node Reboot,
                   Manual Start is OK - but this node is up to 11.1 version)

or apply  patch, backport is available for Solaris x86-64(64 bit) - please confirm the Operating System

please  upload the patch inventory - I may need to raise backport for you

We're getting these errors on the other node when rebooting:

I did see in the boot messages on both hosts these items:
Johnston, Nathaniel [10:39 AM]:
Sep 24 16:05:15 mhddb-nb-2p.philadelphia.pa.bo.comcast.net root: Oracle HA daemon is enabled for autostart.
Johnston, Nathaniel [10:39 AM]:
Sep 24 16:05:18 mhddb-nb-2p.philadelphia.pa.bo.comcast.net unix: vn_rdwr failed with error 0x15
Sep 24 16:05:18 mhddb-nb-2p.philadelphia.pa.bo.comcast.net unix: kobj_load_module: read header failed
Johnston, Nathaniel [10:39 AM]:
Sep 24 16:05:19 mhddb-nb-2p.philadelphia.pa.bo.comcast.net root: exec /u01/app/grid/perl/bin/perl -I/u01/app/grid/perl/lib /u01/app/grid/bin/crswrapexece.pl /u01/app/grid/crs/install/s_crsconfig_mhddb-nb-2p_env.txt /u01/app/grid/bin/ohasd.bin "reboot"
Sep 24 16:05:19 mhddb-nb-2p.philadelphia.pa.bo.comcast.net root: exec /u01/app/grid/perl/bin/perl -I/u01/app/grid/perl/lib /u01/app/grid/bin/crswrap
Johnston, Nathaniel [10:40 AM]:
Sep 24 16:05:36 mhddb-nb-2p.philadelphia.pa.bo.comcast.net mDNSResponder (Engineering Build) (Nov 2 2009 05:02:07) [5272]: starting
Sep 24 16:05:37 mhddb-nb-2p.philadelphia.pa.bo.comcast.net mDNSResponder: Oracle mDNSResponder starting

======================================
                               WORKAROUND
======================================

11gR2 CRS doesn't startup after node reboot [ID 1050164.1]


Modified 31-JAN-2010     Type PROBLEM     Status PUBLISHED

In this Document
  Symptoms
  Changes
  Cause
  Solution




Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.1.0 to 11.2.0.1.0 - Release: 11.2 to 11.2
Generic Linux

Symptoms

  • Installation of the 11gR2 Grid Infrastructure on a Linux cluster completed successfully
  • OCR & Voting files located in ASM diskgroup
  • using ASMLIB driver
  • ASM disks are located on multipath devices (/dev/mapper/)
  • following a node reboot CRS does not startup
  • CSS daemon log shows the following message:
    2010-01-13 09:04:15.075: [ CSSD][1150449984]clssnmvDDiscThread: using discovery string for initial discovery
    2010-01-13 09:04:15.075: [ SKGFD][1150449984]Discovery with str::
    2010-01-13 09:04:15.075: [ SKGFD][1150449984]UFS discovery with ::
    2010-01-13 09:04:15.075: [ SKGFD][1150449984]OSS discovery with ::
    2010-01-13 09:04:15.076: [ SKGFD][1150449984]Discovery with asmlib :ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so: str ::
    2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA1:
    2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA2:
    2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA3:
    2010-01-13 09:04:15.076: [ SKGFD][1150449984]Fetching asmlib disk :ORCL:DATA4:
    2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
    2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
    2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
    2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
    2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvDiskVerify: Successful discovery of 0 disks
    2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
    2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvFindInitialConfigs: No voting files found
    2010-01-13 09:04:15.077: [ CSSD][1150449984]###################################
    2010-01-13 09:04:15.077: [ CSSD][1150449984]clssscExit: CSSD signal 11 in thread clssnmvDDiscThread
    2010-01-13 09:04:15.077: [ CSSD][1150449984]###################################
    2010-01-13 09:04:15.077: [ CSSD][1139960128]clssgmClientShutdown: total iocapables 0
    2010-01-13 09:04:15.077: [ CSSD][1139960128]clssgmClientShutdown: graceful shutdown completed.
    2010-01-13 09:04:15.077: [ CSSD][1150449984]
  • running the cluster verification utility returns the following messages:
    /cluvfy stage -post crsinst -n racnode1

    Performing post-checks for cluster services setup
    Checking node reachability...
    Node reachability check passed from node "racnode1"
    Checking user equivalence...
    User equivalence check passed for user "grid"
    Checking time zone consistency...
    Time zone consistency check passed.
    ERROR:
    Cluster manager integrity check failed
    PRVF-5434 : Cannot identify the current CRS software version
    UDev attributes check for OCR locations started...
    UDev attributes check passed for OCR locations
    UDev attributes check for Voting Disk locations started...
    ERROR:
    PRVF-5197 : Failed to retrieve voting disk locations
    UDev attributes check failed for Voting Disk locations
    Default user file creation mask check passed
    Checking cluster integrity...
    Cluster integrity check failed This check did not run on the following node(s):
    racnode1
    Checking OCR integrity...
    Checking the absence of a non-clustered configuration...
    All nodes free of non-clustered, local-only configurations
    ERROR:
    PRVF-5300 : Failed to retrieve active version for CRS on this node
    OCR integrity check failed
    Checking CRS integrity...
    ERROR:
    PRVF-5300 : Failed to retrieve active version for CRS on this node
    CRS integrity check failed
    OCR detected on ASM. Running ACFS Integrity checks...
    Starting check to see if ASM is running on all cluster nodes...
    PRVF-5137 : Failure while checking ASM status on node "racnode1"
    Starting Disk Groups check to see if at least one Disk Group configured...
    PRVF-5112 : An Exception occurred while checking for Disk Groups
    PRVF-5114 : Disk Group check failed. No Disk Groups configured
    Task ACFS Integrity check failed
    Checking Oracle Cluster Voting Disk configuration...
    ERROR:
    PRVF-5434 : Cannot identify the current CRS software version
    PRVF-5431 : Oracle Cluster Voting Disk configuration check failed
    User "grid" is not part of "root" group. Check passed
    Post-check for cluster services setup was unsuccessful on all the nodes.


Changes

Node was rebooted after install.

Cause


The CSS daemon crashes because it cannot locate any Voting files in any of the discovered ASM disks, which is indicated by the following message in the CSS daemon log (/log//cssd/ocssd.log):

2010-01-13 09:04:15.077: [ CSSD][1150449984]clssnmvFindInitialConfigs: No voting files found


This error is preceded by the following ASMLIB error:
2010-01-13 09:04:15.077: [ SKGFD][1150449984]ERROR: -15(asmlib ASM:/opt/oracle/extapi/64/asm/orcl/1/libasm.so op asm_open error Operation not permitted)
suggesting that ASMLIB has problem accessing the ASM disk.


Solution

1. either edit the file /etc/sysconfig/oracleasm-_dev_oracleasm    and change the lines:
ORACLEASM_SCANORDER=""
ORACLEASM_SCANEXCLUDE=""
to
ORACLEASM_SCANORDER="dm"
ORACLEASM_SCANEXCLUDE="sd"
or alternatively run the following command (as user root)
/usr/sbin/oracleasm configure -i -e -u user -g group -o "dm" -x "sd"

2. stop & restart ASMLIB as user root using:
/usr/sbin/oracleasm exit
/usr/sbin/oracleasm init

3. restart CRS or reboot node

The above steps need to be executed on all nodes

No comments: