Weblogic Managed Server Start Fails: “the Node Manager associated with machine SOAHOST1 is not reachable”

We have faced weird issue while starting Weblogic managed server from admin console. We spent good amount of time in analyzing and fixing the issue. Sharing this if it can help others.

We made a configuration change in weblogic domain that required to restart all servers including admin server. We did rolling restart of servers. All servers were started except one of the managed servers. While starting the server from console, we got the error: “the Node Manager associated with machine SOAHOST1 is not reachable

We logged in into manged server VM and noticed that NM was running fine. Somehow, admin server showing status as unreachable in admin console (Environment -> Machines -> Machine Name -> Monitoring -> Node Manager Status). To confirm that it was not network issue, we tested network connectivity using telnet (telnet managedServerHost:5556). Connection was successful. We checked the Admin server log and found below error:

<Nov 16, 2022 11:41:07,357 AM PST> <Error> <oracle.ohs.mbeans> <BEA-000000> <Received error message from Node Manager Server: [Access to domain 'soa_domain' for user 'testnmgr' denied]. Please check Node Manager log for details.
        weblogic.nodemanager.NMException: Received error message from Node Manager Server: [Access to domain 'soa_domain' for user 'testnmgr' denied]. Please check Node Manager log for details.
                at weblogic.nodemanager.client.NMServerClient.checkResponse(NMServerClient.java:889)
                at weblogic.nodemanager.client.NMServerClient.connect(NMServerClient.java:739)
                at weblogic.nodemanager.client.NMServerClient.checkConnected(NMServerClient.java:668)
                at weblogic.nodemanager.client.NMServerClient.checkConnected(NMServerClient.java:674)
                at weblogic.nodemanager.client.NMServerClient.invocationRequest(NMServerClient.java:449)
                at weblogic.nodemanager.client.DefaultSSLClient.invocationRequest(DefaultSSLClient.java:14)
                at weblogic.nodemanager.mbean.NodeManagerRuntime.invocationRequest(NodeManagerRuntime.java:1099)
                at oracle.ohs.OHSConfigHelper.nmCommand(Unknown Source)

At this time we were clueless as the change done was not related to node manager password and also we hadn’t encountered this issue while restarting other managed servers. To check what was the password being sent to node manager by Admin server, we ran strace command on Admin server process. We couldn’t read the network traffic as connection between Admin server and node manager was SSL. To capture traffic in plain text, we changed connection type of NM to PLAIN in console and node manager configuration ($DOMAIN_HOME/nodemanager/nodemanager.properties). With this change, we were able to capture password from strace output. Truncated strace output is as below:

72013 22:27:19 sendto(719<socket:[1849717557]>, "\0&+OK Current domain set to 'soa_domain'\0\0", 42, 0, NULL, 0) = 42 <0.000027>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "\0\r", 2, 0, NULL, NULL) = 2 <0.000116>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "USER nm_user", 13, 0, NULL, NULL) = 13 <0.000016>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "\0\0", 2, 0, NULL, NULL) = 2 <0.000016>
72013 22:27:19 sendto(719<socket:[1849717557]>, "\0\34+OK NM usr set to 'nm_user'\0\0", 32, 0, NULL, 0) = 32 <0.000022>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "\0\20", 2, 0, NULL, NULL) = 2 <0.000112>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "PASS nmpassword", 16, 0, NULL, NULL) = 16 <0.000053>
72013 22:27:19 recvfrom(719<socket:[1849717557]>, "\0\0", 2, 0, NULL, NULL) = 2 <0.000016>
72013 22:27:19 access("/oracle/config/mserver/domains/soa_domain/config/nodemanager/nm_password.properties", R_OK) = 0 <0.000019>
72013 22:27:19 sendto(719<socket:[1849717557]>, "\0=-ERR Access to domain 'soa_domain' for user 'nm_user' denied\0\0", 65, 0, NULL, 0) = 65 <0.000026>

We were able to connect to NM using WLST nmConnect command with the captured password. So, we concluded that there was no issue with password. We compared $DOMAIN_HOME/config/nodemanager/nm_password.properties file across all managed servers and all were same. As we are ran out of ideas, we wanted to try out entering credentials in plain text and restart NM. After NM restart, NM status changed to Active in console and we didn’t get earlier error when clicked on server Start button in console. I was confident that server would eventually go into RUNNING state, I went for a brisk walk. When I came back, to my surprise, server state changed to FAILED_NOT_RESTARTABLE state.

Luckily, this time weblogic written much clearer log:

<Nov 17, 2022 12:47:14,635 AM PST> <Critical> <WebLogicServer> <BEA-000386> <Server subsystem failed. Reason: A MultiException has 2 exceptions.  They are:
1. java.lang.AssertionError: java.lang.reflect.InvocationTargetException
2. java.lang.IllegalStateException: Unable to perform operation: post construct on weblogic.server.channels.ChannelService

A MultiException has 2 exceptions.  They are:
1. java.lang.AssertionError: java.lang.reflect.InvocationTargetException
2. java.lang.IllegalStateException: Unable to perform operation: post construct on weblogic.server.channels.ChannelService

        at org.jvnet.hk2.internal.ClazzCreator.create(ClazzCreator.java:392)
        at org.jvnet.hk2.internal.SystemDescriptor.create(SystemDescriptor.java:487)
        at org.glassfish.hk2.runlevel.internal.AsyncRunLevelContext.findOrCreate(AsyncRunLevelContext.java:305)
        ...
weblogic.descriptor.DescriptorManager$SecurityServiceImpl$SecurityProxy._invokeServiceMethod(DescriptorManager.java:178)
        Truncated. see log file for complete stacktrace
Caused By: weblogic.security.internal.encryption.EncryptionServiceException: com.rsa.jsafe.JSAFE_PaddingException: Invalid padding.
        at weblogic.security.internal.encryption.JSafeEncryptionServiceImpl.decryptBytes(JSafeEncryptionServiceImpl.java:144)
        at weblogic.security.internal.encryption.JSafeEncryptionServiceImpl.decryptString(JSafeEncryptionServiceImpl.java:192)
        at weblogic.security.internal.encryption.ClearOrEncryptedService.decrypt(ClearOrEncryptedService.java:99)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        Truncated. see log file for complete stacktrace
Caused By: com.rsa.jsafe.JSAFE_PaddingException: Invalid padding.
        at com.rsa.jsafe.JSAFE_SymmetricCipher.decryptFinal(Unknown Source)    

A quick search in oracle support, gave hint that it may be due to incorrect/corrupted $DOMAIN_HOME/security/SerializedSystemIni.dat. We copied the file from admin server to the managed server. We were able to start managed server from console without any error after this change.

We are not exactly sure how SerializedSystemIni.dat got corrupted. But we noticed that Admin server was started from the managed server domain bin location just before the issue. We were suspecting that this could have corrupted the file.

Had we copied the ini file in the first step, we wouldn’t even had to re-enter user name and password in nm_password.properties. Due to corruption in the ini file, nodemanager was not able to decrypt the credentials in nm_password.properties.

Posted In Uncategorised

Related Post