How to get the full news story in CSV in Text Style

I have fetched the news story in a dataframe and saved it to a CSV. The data is having HTML tags. Is there a way i can get the story in plain text?


syntax1 = "(Hospital OR Health Center OR Medical center OR health system OR university hospital OR Emergency Department OR Inpatient OR Rehabilitat OR ICU ) AND ( build OR reopen OR construct OR expansion OR upgrade OR develop OR repurpose OR modern )"
df = ek.get_news_headlines(syntax1,100,date_from="2021-03-25T00:00:00", date_to="2021-04-10T00:00:00")
stories = pd.DataFrame(columns=['DATE','STORY'])
for index, headline_row in df.iterrows():   
    story = ek.get_news_story(headline_row['storyId'])
    stories = stories.append({'DATE':index,'STORY':story}, ignore_index=True)
stories = stories.set_index('DATE')
result = pd.concat([df, stories], axis=1)
result.to_csv("news.csv")

The result dataframe looks like this. I want to get rid of the html tags.

image

Best Answer

  • @alankar.gupta So our news stories are delivered as HTML - so you can use a package like Beautiful Soup (BS4) to strip the text of its html, hyperlinks etc. Please see this article on how to do it. I hope it can help.

Answers

  • A simple solution with html2text lib to extract text from html :

    import html2text
    ...

    result = pd.concat([df, stories], axis=1)result['STORY'] = result['STORY'].apply(html2text.html2text)result.to_csv("news.csv")