How to sort RDP tick history files by size

Tony.Zhu_R · October 2022

Following is a sample of response by pulling a file-set via RDP CFS APIs. Can you advise how to use 'jq' to sort the response by the value of 'size'? Thanks.

{
  "value": [
    {
      "id": "4a8e-1abb-6fd8c40c-94f4-352733f33ebc",
      "name": "salesforce-1-XXX_normalised_89568BBAF27C4E2EAC292CC28D80E875",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "fileName",
          "value": "hmh tas 2013"
        },
        {
          "name": "size",
          "value": "1858351473"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "message_id",
          "value": "e9d7e3cb-e0ff-4f4c-8150-d327a0f9e51b"
        },
        {
          "name": "md5",
          "value": "1de22361eebb6993217e5da10e9a4798"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:28Z"
        }
      ],
      "files": [
        "4478-d443-7f611230-bdf4-86c64a1deb05",
        "45d2-4891-60c79c56-a3bd-71ddb0ba2425",
        "4761-13f8-a64f6b32-ae82-250cc77c6b99"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:07:02Z",
      "availableTo": "2023-01-13T18:26:30Z",
      "status": "READY",
      "created": "2022-08-17T12:07:03Z",
      "modified": "2022-08-17T12:07:04Z"
    },
    {
      "id": "4aa5-3b50-d2924fcb-8633-032e977e0041",
      "name": "salesforce-1-XXX_normalised_37316F8A011E432AAA540E5868132330",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "md5",
          "value": "e55e6731f5b008d28e1c070436bc5ebb"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "message_id",
          "value": "b6a715b3-e1f5-4b1f-bdb4-3aded9173939"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:26Z"
        },
        {
          "name": "fileName",
          "value": "XXX index tas 2007"
        },
        {
          "name": "size",
          "value": "1197364640"
        }
      ],
      "files": [
        "41c3-80e1-f008c970-b117-72dfa0d856c3",
        "4565-eed3-1caf7c98-9872-5dd0fe328f9f",
        "45be-0607-b8b66964-8f94-2369ce109094"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T11:03:06Z",
      "availableTo": "2023-01-13T16:34:39Z",
      "status": "READY",
      "created": "2022-08-17T11:03:08Z",
      "modified": "2022-08-17T11:03:10Z"
    },
    {
      "id": "4ab0-7d5e-fefa8fd5-98e2-6a34b035cfff",
      "name": "salesforce-1-XXX_normalised_BB8B299D6C634AF580CE9DAC7C5F28A8",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "md5",
          "value": "9552ea275e7eb1987ee1627f845a3a4b"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "message_id",
          "value": "9e719be7-0dde-4a01-b954-7f4ab9348d16"
        },
        {
          "name": "size",
          "value": "2850978035"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "fileName",
          "value": "hcus tas 2021"
        }
      ],
      "files": [
        "4708-70f8-41d7d449-b522-2c3694b62e79",
        "4c5c-d187-3e002a0b-a6a5-6c957685a0a8",
        "4f18-c3cb-7d8daf29-acff-1d1b01b18030"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:09:12Z",
      "availableTo": "2023-01-18T09:48:47Z",
      "status": "READY",
      "created": "2022-08-17T12:09:13Z",
      "modified": "2022-08-17T12:09:14Z"
    }
]

Jirapongse · November 2022

@Tony.Zhu_R

Please try this query:

jq-win64.exe ".value |= sort_by(.attributes | from_entries | .size | tonumber)"

To reverse the order, please use this query.

jq-win64.exe ".value |= sort_by(.attributes | from_entries | .size | tonumber) | .value |= reverse"

zoya faberov · October 2022

Hello @Tony.Zhu_R ,

I think you are requesting Filsets:

https://{{RDP_HOST}}/file-store/{{RDP_VERSION}}/file-sets?bucket={{RDP_TH_Bucket}}&attributes=venue:{{CFS_VBD_VENUE}},view:normalised

Which makes sorting on file size challenging?

Try requesting File details by Fileset id:

https://{{RDP_HOST}}/file-store/{{RDP_VERSION}}/files?filesetId={{fileSetId}}

The output should look like:

{
    "value": [
        {
            "id": "4709-44ae-ae48bed4-8651-6ca30681bec8",
            "filename": "LSE-2018-06-01-NORMALIZEDMP-Report-1-of-1.csv.gz",
            "filesetId": "4000-ee57-9e402927-8e3e-a967e20d26f9",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2018-06-01/data/merged/LSE-2018-06-01-NORMALIZEDMP-Report-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type&quot;: "s3"
            },
            "created": "2022-06-25T02:33:09Z",
            "modified": "2022-06-25T02:33:09Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/4709-44ae-ae48bed4-8651-6ca30681bec8/stream",
            "fileSizeInBytes": 904974,
            "md5": "1d9d418dd96e7659c055b857c2bfc0a9"
        },
        {
            "id": "497f-2e71-5796a24e-8162-917fef100c75",
            "filename": "LSE-2018-06-01-NORMALIZEDMP-Data-1-of-1.csv.gz",
            "filesetId": "4000-ee57-9e402927-8e3e-a967e20d26f9",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2018-06-01/data/merged/LSE-2018-06-01-NORMALIZEDMP-Data-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type&quot;: "s3"
            },
            "created": "2022-06-25T02:33:09Z",
            "modified": "2022-06-25T02:33:09Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/497f-2e71-5796a24e-8162-917fef100c75/stream",
            "fileSizeInBytes": 1067179413,
            "md5": "591c5c30619f26e046cfc8ea1857fa5a"
        }
    ]
}

Enabling you to sort on fileSizeInBytes.

Will this work for your use case, or you are looking for something different?

Tony.Zhu_R · October 2022

@zoya faberov

The JSON I put in above is got from

https://api.refinitiv.com/file-store/v1/file-sets?bucket=TICKHISTORY_CUSTOM_ARCHIVE

Normally, it responds with 25 filesetIds but up to 100 when pageSize = 100 specified.

What I asked is how to use jq to sort those filesetIDs by its attribute of size.

{
          "name": "size",
          "value": "2850978035"
        },

For instance, I need to filter out those filesetIDs with sizes of more than 100GBytes what's the filter to use for jq if the total number of filesetIDs is hundreds?

zoya faberov · October 2022

Hello @Tony.Zhu_R ,

Let me try to explain what I meant, hoping that the explanation will help clarify. I believe that the size that you obtain from the call is not the size of a file, rather is the size of a fileset. For example:

https://{{RDP_HOST}}/file-store/{{RDP_VERSION}}/file-sets?bucket={{RDP_TH_Bucket}}&attributes=venue:{{CFS_VBD_VENUE}},view:normalised

Results in:

{
    "value": [
        {
            "id": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "name": "LSE_normalised_2007-05-28",
            "bucketName": "TICKHISTORY_VBD_UNLIMITED",
            "packageId": "40eb-1240-f887f89f-b6e2-ff08ec2e2603",
            "attributes": [
                {
                    "name": "message_id",
                    "value": "f546fa27-7d1f-477a-89a6-ba2f3ca325e2"
                },
                {
                    "name": "view",
                    "value": "normalised"
                },
                {
                    "name": "depth",
                    "value": "UNLIMITED"
                },
{
                    "name": "size",
                    "value": "23001"
                },
                {
                    "name": "releaseTime",
                    "value": "2007-05-29T04:00:00.000000000Z"
                },
                {
                    "name": "venue",
                    "value": "LSE"
                },
                {
                    "name": "md5",
                    "value": "565390e25c255a712e3d42ef790e07f0"
                }
            ],
            "files": [
 "4414-fdfd-35d0e262-be9a-664b4ea09787",
                "44d8-267c-5a5af2c9-adbe-9abe81c1c060"
            ],
            "numFiles": 2,

Whereas:

https://{{RDP_HOST}}/file-store/{{RDP_VERSION}}/files?filesetId=4000-0fef-8c14bb0e-ac4a-9d9356699d2f

Results in:

"value": [
        {
            "id": "4414-fdfd-35d0e262-be9a-664b4ea09787",
            "filename": "LSE-2007-05-28-NORMALIZEDMP-Report-1-of-1.csv.gz",
            "filesetId": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2007-05-28/data/merged/LSE-2007-05-28-NORMALIZEDMP-Report-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type&quot;: "s3"
            },
            "created": "2021-04-22T15:33:18Z",
            "modified": "2021-04-22T15:33:18Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/4414-fdfd-35d0e262-be9a-664b4ea09787/stream",
            "fileSizeInBytes": 22560,
            "md5": "fda504b3a6efa56a34db998e2128b7f0"
        },
        {
            "id": "44d8-267c-5a5af2c9-adbe-9abe81c1c060",
            "filename": "LSE-2007-05-28-NORMALIZEDMP-Data-1-of-1.csv.gz",
            "filesetId": "4000-0fef-8c14bb0e-ac4a-9d9356699d2f",
            "fileType": "File",
            "description": "Merged data file",
            "storageLocation": {
                "url": "https://s3-us-east-1.amazonaws.com/a205143-use1-prod-results-vbd/normalised/LSE/2007-05-28/data/merged/LSE-2007-05-28-NORMALIZEDMP-Data-1-of-1.csv.gz",
                "rolearn": "arn:aws:iam::259431915815:role/a205143-prod-push-mechanism-EdsCfsS3Access",
                "@type&quot;: "s3"
            },
            "created": "2021-04-22T15:33:18Z",
            "modified": "2021-04-22T15:33:18Z",
            "href": "https://api.refinitiv.com/file-store/v1/files/44d8-267c-5a5af2c9-adbe-9abe81c1c060/stream",
            "fileSizeInBytes": 441,
            "md5": "4da4c501efd0f042bb7d1c1b5130306c"
        }
    ]

23001 = 22560+441

Before starting to parse this, are you looking for sizes of files, or sizes of FileSets, that can potentially contain multiple files?

If you are looking for sizes of FileSets, you can paginate over your results, concatenate them into a dataframe and derive fileset sizes.

If you are looking for sizes of files, you will need an extra step- to paginate through FileSets results from the bucket, and request either File details by FilesetID or File Details by FileID. Both will contain the sizes of the discrete files. You could concatenate these results into a dataframe and obtain all file sizes from this dataframe.

At this time, I am not seeing any files in TICKHISTORY_CUSTOM_ARCHIVE, but this is likely because I lack the permissions to access this bucket.

Does this explanation make sense, is this what you observe on your side?

zoya faberov · October 2022

Hi @Tony.Zhu_R ,

I would like to mention another approach that you may find of interest- RD Library Python.

Please see example:

https://github.com/Refinitiv-API-Samples/Example.DataLibrary.Python/blob/main/Examples/3-Delivery/3.03-Bulk/EX-3.03.02-Bulk-ClientFileStore-TickHistory.ipynb

Section: Browse Tick History buckets.

This I believe will fulfill your requirement, except for pagination.

Additionally see example:

https://github.com/Refinitiv-API-Samples/Example.DataLibrary.Python/blob/main/Examples/3-Delivery/3.03-Bulk/EX-3.03.01-Bulk-ClientFileStore.ipynb

Section: Get FileSets with pagination

Tony.Zhu_R · October 2022

Thanks for above sharing, @zoya faberov

It's not what I asked for, though.

Endpoint: https://api.refinitiv.com/file-store/v1/file-sets?bucket=TICKHISTORY_CUSTOM_ARCHIVE&pageSize=3&attributes=view%3Anormalised

Given the number of fileset in RTH HMC bucket varies, I take 3 as an example with following response.

{
  "value": [
    {
      "id": "4030-fd1e-db11637d-b277-19bbd95c73ac",
      "name": "salesforce-1-XXX_normalised_3C4C0EC0D2A643188BB5E9C1F8C329C0",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "size",
          "value": "3634152545"
        },
        {
          "name": "md5",
          "value": "612d8ee7b77605bc342f148c16db2929"
        },
        {
          "name": "message_id",
          "value": "a5c136ba-4821-4898-a511-18bfa607c86f"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "fileName",
          "value": "hmh tas 2015"
        },
        {
          "name": "view",
          "value": "normalised"
        }
      ],
      "files": [
        "4745-b059-944935e3-a2c1-1f47875d39a1",
        "4b21-037a-58a832d2-b486-ff831a2e223e",
        "4bb1-dc27-81b6d485-8cbc-9854df8c651b"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:07:46Z",
      "availableTo": "2023-01-18T08:13:10Z",
      "status": "READY",
      "created": "2022-08-17T12:07:46Z",
      "modified": "2022-08-17T12:07:47Z"
    },
    {
      "id": "4039-98c8-91ff43cc-95c8-646c7f51f25c",
      "name": "salesforce-1-XXX_normalised_1D80CBA8BE694370B21F08080015732C",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "message_id",
          "value": "94344e73-7d97-4c8a-a773-5246edfeaac3"
        },
        {
          "name": "md5",
          "value": "51c1b3242c42307a6b51383e7d9e2f65"
        },
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:27Z"
        },
        {
          "name": "fileName",
          "value": "hsi tas 2017"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "size",
          "value": "11812238421"
        }
      ],
      "files": [
        "42b7-420b-b2e9cb81-9340-15f12d2e1b52",
        "43d9-8883-8c8649e2-a96c-eaab9b70bed3",
        "4e8e-c1a1-60ce42ad-bcf2-9cfa55b8578a"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:05:46Z",
      "availableTo": "2023-01-13T17:37:03Z",
      "status": "READY",
      "created": "2022-08-17T12:05:46Z",
      "modified": "2022-08-17T12:05:47Z"
    },
    {
      "id": "4061-5f65-2a3a0753-aab7-9f2abaec0016",
      "name": "salesforce-1-XXX_normalised_88D37A2541B44CEEBE715A1300C4F175",
      "bucketName": "TICKHISTORY_CUSTOM_ARCHIVE",
      "packageId": "4a13-95e9-c37783d6-bba0-4851594758d0",
      "attributes": [
        {
          "name": "releaseTime",
          "value": "2022-08-17T10:45:56Z"
        },
        {
          "name": "message_id",
          "value": "b94f669b-a925-4cca-ba7c-911d2811223a"
        },
        {
          "name": "size",
          "value": "17227956"
        },
        {
          "name": "view",
          "value": "normalised"
        },
        {
          "name": "md5",
          "value": "c8db431e72af8ae0599bf9eead6a2a48"
        },
        {
          "name": "fileName",
          "value": "hmce tas 2005"
        }
      ],
      "files": [
        "4842-db68-deb53105-ad1b-48c4126d38d4",
        "4cd5-e0e2-543e2fd3-b8c8-0738f180eb24",
        "4ed5-1054-3b985648-8301-682f0c61e3ba"
      ],
      "numFiles": 3,
      "availableFrom": "2022-08-17T12:08:20Z",
      "availableTo": "2023-01-18T08:33:00Z",
      "status": "READY",
      "created": "2022-08-17T12:08:21Z",
      "modified": "2022-08-17T12:08:21Z"
    }
  ]
}

It's nothing to do with exact file size irrespective of fileset or its constituent files. The question is what kind of filter to use with 'jq' to sort the above response by its attribute below.

 {
          "name": "size",
          "value": "17227956"
        },

For instance, I like to see the all filesets in order of its size, largest to smallest, and vice versa.

So, any idea what kind of 'jq' filter can achieve it?

zoya faberov · October 2022

Hi @Tony.Zhu_R ,

A quick disclaimer- I am not permissioned for the same content in custom archive bucket. I am testing with TICKHISTORY_VBD_UNLIMITED, which I believe to be structured the same as custom. But if you see any discrepancies in the format on your side- please let me know.

What I would like to discuss is not the only approach to sorting this result, but possibly the simplest: putting the result into a dataframe. The added benefit can be that if you would require additional transformations, next, you can do them with a dataframe making them simpler as well.

df = pd.json_normalize(jsonFullResp['value'])

Next, I create a new column size, that contains the size value, and convert the value to integer, so that we can sort the column as containing integer values, not as strings.

for index, row in df.iterrows():
    print(index, [sub['value'] for sub in df['attributes'][index] if sub['name'] == 'size'])
    df.loc[index,'size'] = [sub['value'] for sub in df['attributes'][index] if sub['name'] == 'size']
df.loc[index,'size'] = int(df.loc[index,'size'])

My result looks thus:

And now I can sort it:

df2 = df.sort_values(by=['size'])
df2

Getting:

Please let us know if this is what you are looking to do?

zoya faberov · October 2022

Hi @Tony.Zhu_R ,

I cannot test with your exact result at the moment, but I would expect the larger result sets not to be made available in a single request, and to require iterative pagination via skip token.

If that is the case, we can concatenate each new page of the results to the total results dataframe, and sort by integer size only once the result that is obtained is complete.

Let us know if this is something that you observe to be required, and would like to discuss more?

Tony.Zhu_R · November 2022

@zoya faberov

As raised initially, I'm looking for 'filter' to run with jq so sort the fileset by its 'size'

Can you use above sample of 3 filesets and sort them by size of each with jq instead of pandas?

How to sort RDP tick history files by size

Best Answer

Answers

Categories