DF Document tagging changes

On LinkQ Project we need to display following tags related to the Article:

Company Relevance (Like: Nike 80%), Topics and Events.

For Events we plan to use "detectedEvents_attr" and for Topics - "DocCat_attr"

The question is with the Organizations. We have them in "calais_relevance_*_attr" which is array of strings with Locations, Industries, Organizations etc.

To distinguish what is the organization here we need to make an additional request for each string.

DF Request: entity/search returns following relevance info:

"calais_relevance_80_attr":[

"Application Software", - Industry

"Cleveland", - Location

"National Basketball Association", - Company

],

Looks like that to distinguish the organization here we need to look into "Organization_attr" and than find such a company in "calais_relevance_*_attr". The other option is to make a request to each string and find out what is the organization.

We wander if "calais_relevance_*_attr" can contain following type description, like:

"calais_relevance_80_attr":[

{"_type": "Industry",

"label: "Application Software"},

{"_type": "Location",

"label: "Cleveland"},

{"_type": "Organization",

"OrganizatuinId or Uri" : "..........",

"label: "National Basketball Association"}

Here is an example of permid.ord tagging response for an organization:

{

"_typeGroup":"entities",

"_type":"Organization",

, "relevance":0.8

...}


Tagged:

Best Answer

  • Excellent question. If that's all you're doing, and you don't need further integration of the TRIT output into the Thomson Reuters Knowledge Graph, then your use case would be better served by consuming TRIT output directly and skipping Data Fusion processing altogether.

Answers

  • Why not query for the connected organizations (anlyze/search) and filter by score on the client?

  • looking into analyze search. We mentioned some news returned without tagging (but pertId tags it) and earlieat news we receive is (now-2.5 hours). Earliest news also has no tagging. Does tagging take some time to appear?

  • Currently we have to make 2 requests:

    1st one to receive news (entity/search) which has connected companies name as a string, 2nd one - to receive connected company id (analyse/search). This causes an extra requests and as a result extra time, which became significant in case of thousand news.

    That is why we propose to have everything in one place and for entity/search request for a document type return structured response with company type and id:

    "calais_relevance_80_attr":[

    {"_type": "Organization",

    "OrganizatuinId or Uri" : "..........",

    "label: "National Basketball Association"}

    .... ]

  • Ingest, tagging, and indexing are independent processes. Our objective is to have the tagged news visible no later than 12 hours after ingest.

    The following query is used to test for it:

    GET /datafusion/api/entity/search?sort=related_uri_count&dir=desc&limit=10&extraFields=id_attr,lastModified_attr_dt&searchString=lastModified_attr_dt:[NOW-12HOURS TO NOW]

    A document is considered tagged if it contains the following field:

    "id_attr": "http://id.opencalais.com/
  • That would defeat the purpose of having a graph database, wouldn't it?

    If you can deal with the output in bulk you can tokenize the search (entity/search/tokenize) and then plug the token back into the second search call that you can page through. The following query will give you orgs connected to the original search list through OrganizationToDocument predicate:

    GET /datafusion/api/entity/search?limit=10&parentUris=039e81f32c70ab168c8a1c8cf49cabfb&parentPredicateFilters=120|||http://s.opencalais.com/1/pred/OrganizationToDocument

  • That would defeat the purpose of having a graph database, wouldn't it?

    If you can deal with the output in bulk you can tokenize the search (entity/search/tokenize) and then plug the token back into the second search call that you can page through. The following query will give you orgs connected to the original search list through OrganizationToDocument predicate:

    GET /datafusion/api/entity/search?limit=10&parentUris=039e81f32c70ab168c8a1c8cf49cabfb&parentPredicateFilters=120|||http://s.opencalais.com/1/pred/OrganizationToDocument

    See also:

    https://community.developers.refinitiv.com/questions/7229/how-do-you-limit-the-number-of-levels-returned.html

    https://community.developers.refinitiv.com/questions/10447/filter-entities-by-relationship-type.html

  • Separately, we're working on improving performance of paralell queries and queries in general. ETA Q2.

  • Thank you, Tomasz

    I understand the perpose of analyse/search request. However, If we already have all necessary information (organization name and it's relevance), why not to have this id included in entity/search response?

    In our case we need to make 2nd request just to retrieve org id.

  • Excellent question. If that's all you're doing your use case would be better served by consuming TRIT output directly and skipping DF altogether.