A 'get_news_story' request into a dataframe

di.ti · September 2023

Hi, I am using ek.get_news_headlines to display a dataframe of 5 news articles for a particular company. i.e.

df = ek.get_news_headlines('GOOG.O AND Language:LEN', date_from='2021-01-01T09:00:00', date_to='2023-06-30T23:59:59', count = 5)

The above works fine at display the last 5 storyId's... but i'd like to use the ek.get_news_story request to loop through the rows in the above df and pull the article from each storyID into another dataframe? When I try the below snippet - which I found on another post - I just get a HTML dump from the first storyId only.

for idx, storyId in enumerate(headlines['storyId'].values): #for each row in our df dataframe

newsText = ek.get_news_story(storyId) #get the news story

time.sleep(5) # sleep for 5 seconds

print(newsText)

I'd ideally like to see 1 new dataframe containing 5 rows (one row for each news article), one column with the news article's title, another column containing just the text from each article (no HTML tags!), and then another column of the URL.

Any help would be greatly appreciated.

Thank you!

Jirapongse · September 2023

@di.ti

Thank you for reaching out to us.

To get the story text (no HTML tag), you need to use Refinitiv Data Library for Python. The example code is avaiable on GitHub.

The code looks like this:

import time
import pandas as pd
df = pd.DataFrame(columns=['headline', 'story', 'storyid'])
headlines = rd.news.get_headlines('GOOG.O AND Language:LEN', 
                                  start='2021-01-01T09:00:00', 
                                  end='2023-06-30T23:59:59', 
                                  count = 5)
for index, row in headlines.iterrows():    
    newsText = rd.news.get_story(row['storyId'], format=rd.news.Format.TEXT) #get the news story
    df = df.append({'headline':row['headline'],'story':newsText,'storyid':row['storyId']}, ignore_index=True)
    time.sleep(5) 
    
df

The ouput is:

di.ti · September 2023

Thank you, this worked. Any idea of how I can include a column for the timestamp of each article too?

Jirapongse · September 2023

Please this one:

import time
import pandas as pd
df = pd.DataFrame(columns=['timestamp','headline', 'story', 'storyid'])
headlines = rd.news.get_headlines('GOOG.O AND Language:LEN', 
                                  start='2021-01-01T09:00:00', 
                                  end='2023-06-30T23:59:59', 
                                  count = 5)
headlines = headlines.reset_index()
for index, row in headlines.iterrows():    
    newsText = rd.news.get_story(row['storyId'], format=rd.news.Format.TEXT) #get the news story
    df = df.append({'timestamp':row['versionCreated'],'headline':row['headline'],'story':newsText,'storyid':row['storyId']}, ignore_index=True)
    time.sleep(5) 
    
df

di.ti · September 2023

thank you @Jirapongse, this was exactly what i was looking for!

One last question please re: this topic

Is it possible to do a freeform search as part of this news query? i.e. if I wanted to pull news articles into a data frame where "Elon Musk SpaceX" was my search term?

Thank you!

Jirapongse · September 2023

@di.ti

Yes, you can use the free text search.

df = ek.get_news_headlines(query='\\"Elon Musk SpaceX\\"', count=100)
df

di.ti · October 2023

Hi
@Jirapongse, another question please - how would I run the same query by using the company's PermID instead of the "TSLA.O" code? Some of the company's in my search are not publicly traded. Thank you!

di.ti · October 2023

Its ok @Jirapongse, i worked it out:

get_headlines('4297089638 AND SIG AND Language:LEN',

A 'get_news_story' request into a dataframe

Best Answer

Answers

Categories