How to import json files of metadata of Research reports?

My customer who contracted for both Research API and Research bulk feed(= historical data of real-time research reports) is now trying to import a metadata file of the Research bulk feed to python using the following scripts, but can't import it successfully.

Can you please advise on how to import the metadata file to python successfully?


Scripts:

import pandas as pd

df = pd.read_json("ctb4614_050_metadata.json", lines=True)


Returen:

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-87067cc29fae> in <module>
1 # df = pd.read_json("ctb4614_050_metadata.json", orient="split")
----> 2 df = pd.read_json("ctb4614_050_metadata.json", lines=True)
3 # df = pd.read_json("ctb4614_050_metadata.json", lines=True, orient="values")
4 # df = pd.read_json("ctb4614_050_metadata.zip", compression='infer')

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression)
590 return json_reader
591
--> 592 result = json_reader.read()
593 if should_close:
594 try:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in read(self)
713 elif self.lines:
714 data = ensure_str(self.data)
--> 715 obj = self._get_object_parser(self._combine_lines(data.split("\n")))
716 else:
717 obj = self._get_object_parser(self.data)

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in _get_object_parser(self, json)
737 obj = None
738 if typ == "frame":
--> 739 obj = FrameParser(json, **kwargs).parse()
740
741 if typ == "series" or obj is None:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in parse(self)
847
848 else:
--> 849 self._parse_no_numpy()
850
851 if self.obj is None:

/opt/anaconda3/lib/python3.7/site-packages/pandas/io/json/_json.py in _parse_no_numpy(self)
1091 if orient == "columns":
1092 self.obj = DataFrame(
-> 1093 loads(json, precise_float=self.precise_float), dtype=None
1094 )
1095 elif orient == "split":

ValueError: Expected object or value

Best Answer

  • wasin.w
    wasin.w admin
    Answer ✓

    Hello @hiroko.yamaguchi

    The ctb4614_050_metadata.json file is not a valid JSON message format file. It contains a lot of "docID" nodes without array element (example {"docID": 1}, {"docID": 2}, {"docID":3}).

    The application needs to add "[" and "]" to set all "docID" nodes in the same array element

     [{"docID": 1}, {"docID": 2}, {"docID":3}]

    I have tried to add "[" at the first of the file and "]" at the end of file, then load with pd.read_json("ctb4614_050_metadata.json") and it works fine.

    image

Answers

  • I could replicate it! Thank you!

    On the other hand, they're provided with lots of such a file of metadata for all brokers they're entitled to, therefore, is there any way to add the "[" and "]" programmatically?

  • Hi @hiroko.yamaguchi
    I think the first step is to verify if those metadata json files are the valid JSON message or not , then fix the file/content based on how errors or invalid format.


    The Python application can use json.load() function to validate JSON file, but the file needs to be manual fix if it is invalid format.