OmmConsumer, ping timeout, channel recover and maximum load.

Hi,
We have an application using EMA Java to subscribe, process and then republish some data to our TREP. We are seeing some issues where the OmmConsumerImpl loses the connection (channel down) and then fails to recover. We have a ChannelSet configuration and you can see from the logs that the OmmConsumerImpl is bringing up a channel, getting a ping timeout (30seconds) and then trying the next channel. The ADS servers were both up at the time of this failure and other applications were working fine so this is an issue within EMA failing the connection locally. We have seen the connection recover from a channel down at other times but the example below shows that it never recovers.

The overall issue is also why we get the channel down events in the first place? The ChannelSet failing to recover is also a fairly big issue.

Is there a maximum capacity that we should use a single instance of a OmmConsumer for? We are subscribing to around 8000 tickers and we were receiving about 10k updates/second at the time of the initial Channel Down event.

Perhaps we need to tune some of the parameters, we have:

Consumer:

  • LoginRequestTimeOut=30000ms
  • User dispatch,
  • MaxDispatchCountUserThread= default (100), should we set this higher given we expect 1000s of messages a second?
  • DispatchTimeoutApiThread=0 (default), should we adjust this?

Channel:

  • NumInputBuffers=2048
  • GuaranteedOutputBuffers=5000
  • ConnectionPingTimeout= Default (looks like the 30second default below)
  • SysRecvBufSize - we're not setting this, should we?
  • SysSendBufSize - we're not setting this, should we?


Here are the logs from the channel down events:

2018-05-04 14:35:41.110 WARN  [ElektronConsumerDispatcher] 
access.OmmConsumerImpl - loggerMsg| ClientName: ChannelCallbackClient|
Severity: Warning| Text: Received ChannelDownReconnecting event on chan
nel
2.rmds.| RsslReactor @1d63c996| RsslChannel @4f10502a| Error Id 0|
Internal sysError 0| Error Location null| Error text SocketChannel.read
returned -1 (end-of-stream)| loggerMsgEnd|
2018-05-04 14:35:41.205
WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: LoginCallbackClient| Severity: Warning| Text: RDMLogin
stream state was changed to suspect wit
h status message| username <not set>| usernameType <not set>| State: Open/Suspect/None - text: ""| loggerMsgEnd|
2018-05-04
14:35:41.243 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelDictionary| Severity: Warning| Text:
RDMDictionary stream state was changed to suspect with status message|
streamId 3| Reason State: Open/Suspect/None - text: "channel down."|
loggerMsgEnd|
2018-05-04 14:35:41.244 WARN
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelDictionary| Severity: Warning| Text: RDMDictionary
stream state was changed to suspect with status message| streamId 4|
Reason State: Open/Suspect/None - text: "channel down."| loggerMsgEnd|
2018-05-04
14:35:41.245 WARN [ElektronConsumerDispatcher]
heartbeat.HeartbeatService - Error event for heartbeat message on HB.
ErrorEvent{subject=HB, state=UNKNOWN, statusMsg=channel down.}
2018-05-04
14:35:41.255 WARN [ElektronConsumerDispatcher]
subscribe.RmdsTickClient - Received: ErrorEvent{subject=..,
state=UNKNOWN, statusMsg=channel down.}
...
2018-05-04
14:35:42.877 WARN [ElektronConsumerDispatcher] subscribe.RmdsTickClient
- Received: ErrorEvent{subject=.., state=UNKNOWN, statusMsg=channel
down.}
2018-05-04 14:35:42.877 WARN [ElektronConsumerDispatcher]
subscribe.RmdsTickClient - Received: ErrorEvent{subject=..,
state=UNKNOWN, statusMsg=channel down.}
2018-05-04 14:35:42.878 INFO
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Info| Text: Received
ChannelUp event on channel 1.rmds.| Instance Name CONSUMER_B_PREPROD_1|
Component Version ads2.6.9.L1.linux.tis.rrg 64-bit| loggerMsgEnd|
...
2018-05-04
14:40:44.365 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Warning| Text:
Received ChannelDownReconnecting event on channel 1.rmds.| RsslReactor
@1d63c996| RsslChannel @4f10502a| Error Id 0| Internal sysError 0| Error
Location Reactor.processWorkerEvent| Error text Ping error for channel:
Lost contact with connection...| loggerMsgEnd|
2018-05-04
14:40:44.366 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: LoginCallbackClient| Severity: Warning| Text:
RDMLogin stream state was changed to suspect with status message|
username <not set>| usernameType <not set>| State:
Open/Suspect/None - text: ""| loggerMsgEnd|
2018-05-04 14:40:44.367
WARN [ElektronConsumerDispatcher] subscribe.RmdsTickClient - Received:
ErrorEvent{subject=.., state=UNKNOWN, statusMsg=channel down.}
...
2018-05-04
14:40:45.391 INFO [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Info| Text:
Received ChannelUp event on channel 2.rmds.| Instance Name
CONSUMER_B_PREPROD_1| Component Version ads2.6.9.L1.linux.tis.rrg
64-bit| loggerMsgEnd|


Later on, we had several ping timeouts:

2018-05-04 15:08:56.215 WARN  [ElektronConsumerDispatcher] 
access.OmmConsumerImpl - loggerMsg| ClientName: ChannelDictionary|
Severity: Warning| Text: RDMDictionary stream state was changed to
suspect with status message| streamId 4| Reason State: Open/Suspect/None
- text: "channel down."| loggerMsgEnd|
2018-05-04 15:08:56.197 WARN
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Warning| Text: Received
ChannelDownReconnecting event on channel 2.rmds.| RsslReactor @6091c183|
RsslChannel @26d64e5a| Error Id 0| Internal sysError 0| Error Location
Reactor.processWorkerEvent| Error text Ping error for channel: Lost
contact with connection...| loggerMsgEnd|
2018-05-04 15:08:56.204
WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: LoginCallbackClient| Severity: Warning| Text: RDMLogin
stream state was changed to suspect with status message| username
<not set>| usernameType <not set>| State: Open/Suspect/None -
text: ""| loggerMsgEnd|
2018-05-04 15:08:56.213 WARN
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelDictionary| Severity: Warning| Text: RDMDictionary
stream state was changed to suspect with status message| streamId 3|
Reason State: Open/Suspect/None - text: "channel down."| loggerMsgEnd|
2018-05-04
15:08:57.269 INFO [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Info| Text:
Received ChannelUp event on channel 1.rmds.| Instance Name
CONSUMER_B_PREPROD_1| Component Version ads2.6.9.L1.linux.tis.rrg
64-bit| loggerMsgEnd|
2018-05-04 15:17:14.069 WARN
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Warning| Text: Received
ChannelDownReconnecting event on channel 1.rmds.| RsslReactor @6091c183|
RsslChannel @26d64e5a| Error Id 0| Internal sysError 0| Error Location
Reactor.processWorkerEvent| Error text Ping error for channel: Lost
contact with connection...| loggerMsgEnd|
2018-05-04 15:17:14.069
WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: LoginCallbackClient| Severity: Warning| Text: RDMLogin
stream state was changed to suspect with status message| username
<not set>| usernameType <not set>| State: Open/Suspect/None -
text: ""| loggerMsgEnd|
2018-05-04 15:17:15.116 INFO
[ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Info| Text: Received
ChannelUp event on channel 2.rmds.| Instance Name CONSUMER_B_PREPROD_1|
Component Version ads2.6.9.L1.linux.tis.rrg 64-bit| loggerMsgEnd|
2018-05-04
15:19:44.217 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Warning| Text:
Received ChannelDownReconnecting event on channel 2.rmds.| RsslReactor
@6091c183| RsslChannel @26d64e5a| Error Id 0| Internal sysError 0| Error
Location Reactor.processWorkerEvent| Error text Ping error for channel:
Lost contact with connection...| loggerMsgEnd|
2018-05-04
15:19:44.218 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: LoginCallbackClient| Severity: Warning| Text:
RDMLogin stream state was changed to suspect with status message|
username <not set>| usernameType <not set>| State:
Open/Suspect/None - text: ""| loggerMsgEnd|
2018-05-04 15:19:45.258
INFO [ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Info| Text: Received
ChannelUp event on channel 1.rmds.| Instance Name CONSUMER_B_PREPROD_1|
Component Version ads2.6.9.L1.linux.tis.rrg 64-bit| loggerMsgEnd|
2018-05-04
16:07:23.093 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Warning| Text:
Received ChannelDownReconnecting event on channel 1.rmds.| RsslReactor
@6091c183| RsslChannel @26d64e5a| Error Id 0| Internal sysError 0| Error
Location Reactor.processWorkerEvent| Error text Ping error for channel:
Lost contact with connection...| loggerMsgEnd|
2018-05-04
16:07:23.094 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: LoginCallbackClient| Severity: Warning| Text:
RDMLogin stream state was changed to suspect with status message|
username <not set>| usernameType <not set>| State:
Open/Suspect/None - text: ""| loggerMsgEnd|
2018-05-04 16:07:24.663
INFO [ElektronConsumerDispatcher] access.OmmConsumerImpl - loggerMsg|
ClientName: ChannelCallbackClient| Severity: Info| Text: Received
ChannelUp event on channel 2.rmds.| Instance Name CONSUMER_B_PREPROD_1|
Component Version ads2.6.9.L1.linux.tis.rrg 64-bit| loggerMsgEnd|
2018-05-04
16:22:31.342 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: ChannelCallbackClient| Severity: Warning| Text:
Received ChannelDownReconnecting event on channel 2.rmds.| RsslReactor
@6091c183| RsslChannel @26d64e5a| Error Id 0| Internal sysError 0| Error
Location Reactor.processWorkerEvent| Error text Ping error for channel:
Lost contact with connection...| loggerMsgEnd|
2018-05-04
16:22:31.343 WARN [ElektronConsumerDispatcher] access.OmmConsumerImpl -
loggerMsg| ClientName: LoginCallbackClient| Severity: Warning| Text:
RDMLogin stream state was changed to suspect with status message|
username <not set>| usernameType <not set>| State:
Open/Suspect/None - text: ""| loggerMsgEnd|

Best Answer

  • After the investigation, somehow this “Error text Ping error for channel: Lost contact with connection...” error message doesn't occur with EMA version 3.2.0.1 (Elektron-SDK_1.2.0.1.L1). It could be used as a workaround/resolution for this issue.

Answers