How do I remove/skip collecting news headlines that are in an invalid format (such as HTML)?

azzopardic · November 2021

I am currently running a Python program that makes use of the get_news_headlines(). I have been using it for a while and almost always managed to collect news articles correctly. I ran my code today to collect news created this weekend and got an error. I believe the error regards a news article that is of HTML format and for some reason cannot be collected. This issue caused the whole program to halt, which is problematic considering I need to eventually start collecting real-time data. Is there a way this issue could be fixed please, by for example, skipping such news articles of invalid format?

The error is as follows:

nick.zincone · November 2021

Hi @azzopardic

Looking at the above function get_news(), I noticed you are performing what appears to be application initialization code, i.e. set_app_key() and reading a config file. How often are you calling get_news()? I would suggest you remove these lines of code outside of the get_news() as they may be initializing your eikon session every time you are trying to get news. I don't know if this is at all related to the issues reported, but it may eliminate side effects of doing this over and over unecessarily.

Gurpreet · November 2021

Hi @azzopardic,

What is the complete code snippet and filter query that you are using. The above shown error message is not adequate for us to determine the cause.

zoya faberov · November 2021

Hello @azzopardic ,

Fully agree with @Gurpreet, and as the error was not manifesting before and either started coming up or just came up now, would suggest to print the payload prior to parsing it, if you see the issue again, you will have the specific news headline that has triggered it, and will be able to paste it into this question, may be helpful in addition to the code.

azzopardic · November 2021

Hi, thank you for looking into my query,

As I stated, this issue takes place in the rare case that a news article is of invalid format. I cannot pinpoint the exact article that is causing this issue, as this error is being raised on attempting to collect the news. I only have access to the storyID of a news article once I have the news DataFrame which I collect through this function.

Here is the code for your perusal:

def get_news(ftr, curr, date_from, date_to):

    ek.set_app_key(xxx) 

    cfg = cp.ConfigParser()
    cfg.read('eikon.cfg')

    news_curr = pd.DataFrame()

    news_curr = ek.get_news_headlines(ftr,
                                        date_from=date_from,
                                        date_to=date_to,
                                        count=100)
...

And another part of the error:

Note:

I tried looking for the news article that's causing the problem, but as I was trying out the same dates and filter, it seems I can't recreate the error

Gurpreet · November 2021

Hi @azzopardic ,

The call is a standard API news head line, and I have never had issue with it. Most likely it is something that you are doing in your code. You will have to provide the complete code and filter query, for us to help you.

azzopardic · November 2021

Hi @Gurpreet,

I have never had a problem like this either, so I doubt it is something from the code, since running it on different dates and filters returns no errors. Just in case, below please find the variables used as the parameters of the function call.

date_from = '2021-11-19T15:18:17'
date_to = '2021-11-22 07:47:51'

all_filters = {
        'EUR/AUD':'R:EURAUD=',
        'EUR/CAD':'R:EURCAD=',
        'EUR/CHF': 'R:EURCHF=',
        'EUR/GBP': 'Topic:FRX AND R:EURGBP=',
        'EUR/JPY': 'R:EURJPY=',
        'EUR/NOK': 'R:EURNOK=',
        'EUR/NZD': 'R:EURNZD=',
        'EUR/SEK': 'R:EURSEK= AND Topic:FRX',
        'EUR/USD': 'Topic:FRX AND R:EUR=',
        'GOLD': 'R:XAU=', 
        'SILVER': 'R:XAGEUR=R',
        'OIL': 'Topic:CRU' 
    }

ftr is one of the values in all_filters

Thank you for all your help

azzopardic · November 2021

Noted with thanks.

zoya faberov · November 2021

Hello @azzopardic and @Gurpreet ,

Perhaps it would be helpful, I have run a quick test on the above, this way:

date_from = '2021-11-19T15:18:17'
date_to = '2021-11-22 07:47:51'
all_filters = {
        'EUR/AUD':'R:EURAUD=',
        'EUR/CAD':'R:EURCAD=',
        'EUR/CHF': 'R:EURCHF=',
        'EUR/GBP': 'Topic:FRX AND R:EURGBP=',
        'EUR/JPY': 'R:EURJPY=',
        'EUR/NOK': 'R:EURNOK=',
        'EUR/NZD': 'R:EURNZD=',
        'EUR/SEK': 'R:EURSEK= AND Topic:FRX',
        'EUR/USD': 'Topic:FRX AND R:EUR=',
        'GOLD': 'R:XAU=', 
        'SILVER': 'R:XAGEUR=R',
        'OIL': 'Topic:CRU' 
    }
for key in all_filters:
    print(key, '->', all_filters[key])
    df = ek.get_news_headlines(all_filters[key],
                                        date_from=date_from,
                                        date_to=date_to,
                                        count=100)
    print(df)

and was not able to reproduce the issue that you observe.

Some results that have returned were empty, others had headlines of up to 100 headlines, but I was not able to reproduce the error.

If this is not what you are doing @azzopardic , please advise what is different in your code, and how to reproduce the issue you are facing so that we can try to see the same?

azzopardic · December 2021

I had another issue recently with regards to invalid formats of news articles, and the same thing happened. I could not recreate the issue using the same code as I was doing before. I believe this is a problem stemming from the articles themselves.

I've just tried it again myself, and I couldn't find the article either. Unfortunately, I do not know how to solve this problem. I don't think that this problem comes from my code, as I have been using it for a couple of months now and I have never had this or similar errors until now, and the error is raised on the line containing the call to get_news_headlines().

Thank you for your assistance.

Gurpreet · December 2021

@azzopardic,

My inclinations is that there is a bug in the code. We are unable to reproduce any errors.

Can you please run your code with the DEBUG logging enabled and show us the logs when this exception happens next time.

azzopardic · December 2021

I will try, thank you.

How do I remove/skip collecting news headlines that are in an invalid format (such as HTML)?

Best Answer

Answers

Categories