Get "Max retries exceed" error or "NoneType" error when doing parallel parsing bundle

截圖-2023-07-12-141631.png截圖-2023-07-12-141808.png

As the title and screenshot of the process, I'm getting some fields latest data (related to price, dividend, splits) on numbers of tickers (symbol list is d dataframe with datastream_id, and tickers, containing about 6000 symbols).

I'm using get_bundle_data in DatastreamPy, also using ProcessPoolExecutor to get data parallelly. Unfortunately, I get errors of "Max retries exceeded with url" and "NontType" error.


Best Answer

  • Jirapongse
    Answer ✓

    @hsheng

    Thank you for reaching out to us.

    It looks like to be a connection or SSL (Secure Socket Layer) issue.

    1689145276827.png

    Moreover, please also check the limitations of the GetDataBundle request in the DSWS user stats and limits document.

    1689147172397.png

    Did the problem happen when using simple requests or non-parallel requests?



Answers

  • Thank you for replying.

    The problem didn't happen when using non-parallel requests,

    but it happened when using simple requests (with parallel parsing) in the beginning.


    I'm very curious about that because the problem happened "sometimes", not every time.

  • Thank you for replying.

    The problem didn't happen when using non-parallel requests,

    but it happened when using simple requests (with parallel parsing) in the beginning.


    I'm very curious about that because the problem happened "sometimes", not every time.

  • Moreover, I think how I used to get bundle data with parallelization is similar to the sample attached which I received from your team in the past.

    DSWS Python Sample.html.zip

  • @hsheng

    I can run the code properly with 3000 instruments.

    import tqdm
    import numpy
    import pandas as pd
    import DatastreamPy


    from concurrent.futures import ProcessPoolExecutor


    df1 = pd.read_html('list.xls')
    df2 = pd.read_html('list1.xls') 
    df3 = pd.read_html('list2.xls') 
    df4 = pd.read_html('list3.xls')
    df = pd.concat([df1[0], df2[0], df3[0], df4[0]], ignore_index=True) 


    ds = DatastreamPy.Datastream(username="username",password="password")


    max_data_points = 100


    request_fields=["UPO","UPH","UPL","UP","X(UVO)*1000",
                    "AF","PO","PH","PL","P","UDD","DD","DPS",
                    "DY","AND","PYD","XDD","SPLDTE","SPLFCT",
                    "DT","IBPDTE"]
    run_symbols = df["Symbol"].tolist()

    batches = numpy.array_split(run_symbols, int(numpy.ceil(len(run_symbols)/numpy.floor(max_data_points / len(request_fields)))))


    total_requests = [ds.post_user_request(','.join(c), list(request_fields), kind=0, start='2023-07-11', end='2023-07-11') for c in batches]
    get_ds_data_pivot = lambda per_request: ds.get_bundle_data(
        [per_request])[0].drop_duplicates(subset=['Instrument','Datatype'], keep='first').pivot(index = 'Instrument',columns='Datatype',values='Value')


    def get_symbol_list_and_check_time(request):   
        return get_ds_data_pivot(request)


    def main():
        with ProcessPoolExecutor(max_workers = 5) as executor:
            result = list(tqdm.tqdm(executor.map(get_symbol_list_and_check_time, total_requests), desc='getting latest data', total = len(total_requests)))


    if __name__ == '__main__':
        main()

    I can't run the code on Jupyter Notebook so I ran it on the console.

    1689238047725.png

    Please check the version of DatastreamPy that you are using. I am using 1.0.12.