Reading large JSON files in Python

Screen Shot 2018-09-07 at 12.10.16

For one of my Python-based projects, I have been using a really large JSON file as a hacky, slow, really-really-bad-but-works-for-me-hence-whatever database. Before a scripts executes, it reads all the data into one single dict:

with open('hacky_database.json', 'r') as f:
    hacky_database = json.load(f)

After it finishes, it writes the data back:

with open('hacky_database.json', 'w') as f:
    json.dump(hacky_database, f, indent=2, sort_keys=True)

It worked okay until the file grew to somewhat bigger than 2 GB, then it started generating this exception on my laptop:

Traceback (most recent call last):
  File "my-serious-script.py", line 39, in 
    hacky_database = json.load(f)
  File "/usr/local/Cellar/python/3.6.4_4/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 296, in load
    return loads(fp.read(),
OSError: [Errno 22] Invalid argument

It turned out, for some reason standard import json chokes upon reading big files. Not having any time or desire to re-write the script to use a proper database, I decided to hack a drop-in solution. Quick googling revealed a simple solution with using a beautiful ijson library. Minimal changes of reading code looked like this:

import ijson

...

with open('hacky_database.json', 'r') as f:
    hacky_database = next(ijson.items(f, ''))

It just worked (although super-slowly, should probably experiment with different ijson backends later).

However, it turned out that old database-saving code stopped working with the following exception:

Traceback (most recent call last):
  File "my-serious-script.py", line 72, in 
    json.dump(hacky_database, f, indent=2, sort_keys=True)
...
TypeError: Object of type 'Decimal' is not JSON serializable

The ijson package uses Decimal instead of floats used by standard json library, so I can’t easily dump()! With the help of this issue, I was able to change my DB-reading code to a real drop-in replacement that worked:

with open('hacky_database.json', 'r') as f:
    json_events = map(
    	lambda e: (e[0], e[1], float(e[2])) if e[1] == 'number' else e, 
    	ijson.parse(f)
    )
    hacky_database = next(ijson.common.items(json_events, ''))

Laziness preserved 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s