Transforming and ingesting complex JSON data with Python by Attila Tóth

Transform and insert complex JSONs into a relational database - without Pandas

Description

Python is a very popular language to write data pipelines, or just process data in general. In this talk, I will show you practical solutions to common data processing challenges when working with complex data files in Python. I will cover some lessons about transforming and ingesting large and deeply nested JSON files into a relational database like PostgreSQL. Even better: using only Python dictionaries and a database connector library (Psycopg2).

During my work, I often need to download and ingest data from different sources. Routinely, I use Python to download the data, transform it, then insert it into a database. But sometimes I come across challenges that make me scratch my head. I’d like to explain how to efficiently organize, transform, and eventually ingest complex JSON files into a PostgreSQL (or any other) relational database. I’m going to use real-world examples and make it easy for everyone to replicate the same solutions in other projects.

In this talk, I will show you practical solutions to common data processing challenges when working with complex data files in Python. I will cover some lessons about transforming and ingesting large and deeply nested JSON files into a relational database like PostgreSQL. Even better: using only Python dictionaries and a database connector library (Psycopg2).

Making sense of complex JSON files can be difficult, especially if you are dealing with loosely structured data or data that you don’t know much about. In this session, you will see a blueprint for downloading, transforming, and (batch) ingesting data from an API with Python. You will see how to tackle some of the challenges of working with deeply nested data structures, a difficult database schema with lots of foreign keys, all the while trying not to rely on large external dependencies like Pandas.

Outline: 1. Introduction & quick agenda Talk briefly about the project that inspired this talk 2. Going through an example challenge API returns large and complex JSON files We need to process it with Python and store it in PostgreSQL Breaking down the steps 3. Showing an example of complex, messy, data (JSON) What makes it hard to process a file like this? 4. Showing what the end result should look like Cleaned data (expected format, etc) Database schema Relationships between data fields 5. Process the data with Python & overcome challenges Cleaning individual data fields Flattening deeply nested dictionaries Inserting tightly-related data into multiple relational tables with foreign keys with Psycopg2 Batch inserting Creating temporary tables to make it easier What to do if you are not a domain expert? (eg. you don’t know what some of the fields mean in the JSON but still need to process them) 6. Next steps Where you can find me Check out the source code which was used for this presentation