AWS File Extraction not working properly

I try to get the stream file of my schedule task, and read it into python, using the rest api call, using the following headers:

headers = {'Authorization': 'Token %s' % auth, 'Prefer': 'respond-async', 'Accept-Charset': 'UTF-8', 'Content-Type': 'application/json',"X-Direct-Download":"true"}

When i do not include X-Direct-Download, i only get part of the data, compared to the csv file i directly download in the web GUI, however, by adding this, i get some text result like the following:

'\x1f�\x08\x00\x00\x00\x00\x00\x00\x00�][s\x1c�n~?��u�\x1aw��\x17��\x12m1�(\x15I%�_T�M�RŖO,�R��\x01fv�;3��Pܥ\x1d\x17�ds���h4\x1a@��/���V�\x7f��������\x7f�yr�\u19db�\u05ef�'

which is clearly not correct. Is there any way to get correct full file as a stream in aws enviorment.

full code:

result = simple_get("/Extractions/ExtractedFiles('%s')/$value" % file_id, auth)

where:

def simple_get(endpoint, auth):

headers = {'Authorization': 'Token %s' % auth, 'Prefer': 'respond-async', 'Accept-Charset': 'UTF-8', 'Content-Type': 'application/json',"X-Direct-Download":"true"}

r = requests.get(url_base + endpoint, headers=headers)

return r

Best Answer

  • @kf2449, in addition to the response by Jirapongse:

    What you observe is actually gzipped data:

    '\x1f�\x08\x00\x00\x00\x00\x00\x00\x00�][s\x1c�n~?��u�\x1aw��\x17��\x12m1�(\x15I%�_T�M�RŖO,�R��\x01fv�;3��Pܥ\x1d\x17�ds���h4\x1a@��/���V�\x7f��������\x7f�yr�\u19db�\u05ef�'

    When handling the data, you are getting different behaviors due to the response headers, which differ between TRTH and AWS downloads.

    TRTH returns the following content related headers:

    Content-Encoding:gzip
    Content-Type:text/plain

    AWS returns different headers:

    Content-Disposition:attachment; filename=_OnD_0x05de99857e5b3036.csv.gz
    Content-Type:application/gzip

    Your application reacts differently to these (many code libraries automatically decide to decompress data (or not) based on response headers), that is why you observe different data formats. It is best to set the appropriate headers and parameters to avoid such behavior. If you follow what is explained in the links sent by Jirapongse you should be able to download the entire file, in the correct format.

    You can also look at the Python code samples set available under the downloads tab, and explained in this document. One of the samples (TRTH_Get_Latest_Schedule_Files) should be of help.

Answers