RTO disconnection happen in case of many RICs to be subscribed

Hello

The customer is evaluating RTO with EMA Java version API.

The code once actually works and can subscribe the data,

However they tried to subscribing many RICs (snapshot) every few minutes, it resulting the disconnection becomes often happen, they see below error message in the log file.


Received ChannelDownReconnecting event on channel Channel_2

(The full log message attached)


I have created the case (09805875: Optimized disconnections on 15th Apr ) to check if anything wrong at the server side, However the server log says "Connection reset by peer" which usually mean the session was disconnected from customer end.

<ads-fanout-med-az1-apse1-prd.1.ads: Info: Thu Apr 15 09:19:37.739375 2021>
RSSL disconnect from "GE-xxxx" at position "10.90.20.37/tkwaplm03b.ewarrant.com" on host "tkwaplm03b.ewarrant.com" using application "256" of version "etaj3.6.1.L1.all.rrg|emaj3.6.1.L1.all.rrg" on channel 61.
Reason: rsslRead() failed with code -1 and system error 11. Text: </local/jenkins/workspace/TREP34XCore_Release/OS/OL7-64/esdk/source/esdk/Cpp-C/Eta/Impl/Transport/rsslSocketTransportImpl.c:676> Error:1002 ipcRead() failure. Connection reset by peer
<END>


Hence the customer and I are assuming API made disconnection by some reason.

I believe the given log is not enough to make deeper analysis. Hence I would ask if the reason of disconnection can be analyzed if the customer help to make change log level?

If yes, please advise how to change the log level. According to the customer they are experiencing the same disconnection but different timing every day. Replication of the problem should be made easily.


Appreciated for your advice

Best regards

Noboru Maruyama


GE-A-01377963-3-5725.txt

Best Answer

  • Hello to all @umer.nalla, @zoya.farberov, @chavalit.jintamalit


    We've tuned the parameter below in EmaConfig.xml

    <GuaranteedOutputBuffers value="10000"/>

    <NumInputBuffers value="500"/>

    <SysRecvBufSize value="2097152"/>

    <SysSendBufSize value="64240"/>


    As well as inserting very small sleep period in between each request.

    consumer.registerClient(EmaFactory.createReqMsg().serviceName("ELEKTRON_DD").payload(batch).interestAfterRefresh(false), appClient);

    Thread.sleep(1);


    It seems this resulted to avoid disconnection so far.


    Thank you

    Noboru Maruyama



Answers

  • Hello ,

    It will be appreciated if someone can take and provide comment ?

    Many thanks in advance

    Best regards

    Noboru

  • Hi @noboru.maruyama4

    Based on the limited information available, my best guess would be that the developer is suffering from a slow consumer scenario i.e. the application cannot consume the incoming data quickly enough, buffers overflow and a disconnect occurs e.g. if their code in the OnRefresh and onUpdate event handlers is spending too long executing on the EMA thread context.

    However, going back to your question, If you want to enable trace, please see the following post:

    How to enable tracing incoming/outgoing messages EMA Java receives/sends - Forum | Refinitiv Developer Community

    The key entry being:

    <XmlTraceToStdout value="1" />

    They will need to add that to the existing default channel config their application is using.

    However, if the issue is slow consumer scenario - this will likely exacerbate the issue - as logging is resource intensive.

  • Hi @umer.nalla

    Many thanks for your comment.

    I similarly had initial guess, slow consumer scenario might be happen, if this is the case the session should be disconnected by server side. Typical error message should be like this.

    <ads-xxxxxxxxxx: Info: Tue Apr 06 03:33:10.046165 2021>

    RSSL disconnect from "GE-xxxx" at position "10.xx.x.xx/xxx.com" on host using application "256" of version "etaj3.6.1.L1.all.rrg|emaj3.6.1.L1.all.rrg" on channel 65 has been disconnected due to an overflow condition.

    <END>

    However the server side log shows "Connection reset by peer " which I was advised this error usually be happen in case of the disconnection was made from downstream , thus I am suspecting it is possible that the API determined to disconnect by some reason.

    I will ask to enable <XmlTraceToStdout value="1" /> to get trace data,

    Thank you

    Best regards

    Noboru Maruyama



  • Hi @noboru.maruyama4

    I agree with you that it does not look like a slow consumer issue - but looking at the original error output above, the API is reporting ChannelDown and that is trying to reconnect - which does not suggest that the API is disconnecting.

    Another explanation could be network issues between the client and the server...

  • Hello @umer.nalla

    I understood your point. Let' see the XML trace then if anything new can be find..

    By the way <XmlTraceToStdout value="1" /> lead bellow error and the option actually did not work.

    SEVERE: loggerMsg
        ClientName: EmaConfig
        Severity: Error
        Text:    Unable to find tagId for XmlTraceToStdout
    loggerMsgEnd

    Is there anything we had to prior to set the parameter? please advice

    Thank you

    Noboru Maruyama

  • Hi @noboru.maruyama4

    Base on example 450, here is the modification:

    image

  • HI @chavalit.jintamalit

    Many thanks for the info.

    I see it works. Will share this with the customer.

    Best regards

    Noboru

  • Hi @chavalit.jintamalit @umer.nalla

    I got feedback from the customer.

    "The log is going to be huge. It is already 120MB after only 5 minutes of running. I am not sure if can keep it running for whole day. Is there a way to configure what to trace instead of everything?"

    I still believe the trace log can be a important information to know what is happen..

    Do you guys have any better idea?

  • Hi @noboru.maruyama4

    For Item streaming log, it is from XmlTraceToStdOut.

    I understand that there is not filter available on this Xml output.


    For connection management, you can use logging.properties file.

    This is an example of logging configuration.

    I added FileHandler and ConsoleHandler to the handlers list (you may remove ConsoleHandler?)

    .level=WARNING
    com.refinitiv.eta.valueadd.reactor.RestReactor.level=FINEST
    handlers=java.util.logging.FileHandler, java.util.logging.ConsoleHandler
    java.util.logging.ConsoleHandler.level=WARNING
    java.util.logging.ConsoleHandler.formatter=java.util.logging.SimpleFormatter
    java.util.logging.FileHandler.level=FINEST
    java.util.logging.FileHandler.pattern=./emaj%u.log
    java.util.logging.SimpleFormatter.format=%1$tY-%1$tm-%1$td %1$tH:%1$tM:%1$tS.%1$tL %4$-7s %2$s %n%5$s


    When you run the app, please make sure to add runtime parameter point logging config to logging.properties file.:

    command line:

    java -cp ./bin;./Libs/* -Djava.util.logging.config.file=logging.properties ConsumerRTO

    Eclipse:

    image

  • Hi @noboru.maruyama4

    Did you manage to enable connection login as recommended by my colleague Chavalit?

    I was on leave yesterday, but the one thing that occurred to me is that if the customer is doing snapshots - that would explain why you don't see the buffer overflow error - as the streams are disconnected once the snapshots are sent.

    So, the question is what is the customer doing with the snapshots when received? is the work being carried out on the EMA thread?

    Can the customer try pacing the snapshot request? e.g. don't request all at once, request a subset, process and then request the next subset. Also, if performing some resource-hungry operation e,g. writing to a database, then create a separate worker thread for this and not in the EMA thread context.

    The above are just suggestions - but the connection logging may provide some clues as to what is going wrong.

  • Hi @umer.nalla

    Many thanks for your suggestion.

    I have asked the customer to pace the snapshot request. According to the customer they have made 5000 request at once, so asking to see how it will be if they divide 1000 x 5.

    By the way I believe 5000 requests should not be problem, how do you think ? Anyway I would like to see the outcome after the customer made pacing to request.

    Thank you

    Noboru Maruyama

  • Hi @umer.nalla @chavalit.jintamalit

    From the ADS logs, I was advised experienced Output threshold breach which is an App initiated disconnect happened.


    RSSL disconnect from "GE-xxxx" at position "10.90.30.xxx/xxxx.xxxx.com" on host "xxxx.xxx.com" using application "256" of version "etaj3.6.1.L1.all.rrg|emaj3.6.1.L1.all.rrg" on channel 60.

    Reason: rsslRead() failed with code -1 and system error 11. Text: </local/jenkins/workspace/TREP34XCore_Release/OS/OL7-64/esdk/source/esdk/Cpp-C/Eta/Impl/Transport/rsslSocketTransportImpl.c:676> Error:1002 ipcRead() failure. Connection reset by peer


    May I know if "Output threshold" can be increased? I am sure this may cause another issue such as slowness. However I would like to know where the output threshold configured in case of EMA.


    Thank you


  • Hello @noboru.maruyama4,

    Please note, as we are experiencing a partial forum notification outage at this time, it is possible that @umer.nalla and @chavalit.jintamalit did not get a chance to read your message.

    It is very likely at this point, that the consumer app is being disconnected by ADS as a slow consumer. Please read this previous discussion thread for more insight and suggestions.

    I would, if the app is consuming 5000 rics, increase GuaranteedOutputBuffers as described in this previous discussion thread, I would try 10000 setting.

    However, if this does not seem to solve, the consumer app code will still need to be re-designed to minimize the processing and time spent in callback, as discussed in above.

    Hope this helps

  • @zoya.farberov

    Many thanks for your comments. With given information, Ask the customer to increase buffer size and see if there will be improvement.

    Thank you very much

    Best regards

    Noboru Maruyama