Yelp

Parsing Yelp Dataset in PostgreSQL #

The dataset contains information about businesses registered at Yelp.com, users, reviews, tips, and check-ins.

The data is serialized in JSON format. However, not in a form you would expect! Each lines in the JSON files are serialized as a different JSON object. Hence, when the files are loaded using a JSON decoder in Python the decoder fails to read the files accurately. One way to solve the problem is to load the JSON file in memory, decode each line separately using a JSON decoder, append each JSON object into a list in Python, and then parse the data from the list. The code for this would be as following.

import psycopg2
import json
 
# Load the whole JSON file in memory
f = open('yelp_academic_dataset_tip.json', 'r')

# Initialize an empty list for temporarily holding JSON objects as list
data = []

# Append each line/JSON object in a list
for line in f:
    data.append(json.loads(line))

# Open a connection to the database
db_conn = psycopg2.connect("dbname='database_name' host='server_address'   user='username' password='password'")

# Iteratively commit the data from the list in database
i=0
while i < len(data):
    cur = db_conn.cursor()
    cur.execute("insert into Tip (user_id, business_id, tip_text, tip_date, likes, type) values (%s, %s, %s, %s, %s, %s)", (data[i]['user_id'], data[i]['business_id'], data[i]['text'], data[i]['date'], data[i]['likes'], data[i]['type']))
    db_conn.commit()
    i += 1

# Close database connection
db_conn.close()

# Close JSON file
f.close()

The above approach, however, has a major shortcoming. We are loading the entire JSON file in memory, and then transforming it into a list object in Python before parsing it into a database. This’ll take a large amount of memory for some files in the dataset (as some of these are more than 1GB in size). A clever way to encounter this could be to read the JSON file line-by-line, manipulate it and parse it into the database. This can be achieved with a simple change in the parser script.

# Initialize an empty list for temporarily holding a JSON object
data = []

# Open a JSON file from the dataset and read data line-by-line iteratively
with open('yelp_academic_dataset_review.json') as fileobject:
    for line in fileobject:
    data = json.loads(line)

The above approach also doesn’t require to close the file.

The parser code can be found on GitHub.

Setting Up PostgreSQL and Python on Ubuntu 16.04LTS #

I have setup PostgreSQL and Python modules for parsing the dataset in both local and remote machines. But in both cases, the base system was the same.

OS: Ubuntu 16.04LTS
Database: PostgreSQL 9.5
Language: Python 3.5
Database management tool: pgAdmin 3 or 4

Before starting, I am assuming you have installed Ubuntu as a virtual machine or as a standalone local machine or as a remote server. Once you’re done, you can move on to install PostgeSQL on Ubuntu.

First, refresh the apt software package list.

$sudo apt-get update

Now we can use install the PostgreSQL package, along with the -contrib package. Since we are installing from a pre-packaged distribution of PostgreSQl, the -contrib package is needed to install additional modules of PostgeSQL.

$sudo apt-get install postgresql postgresql-contrib

If the installation process is successful, it will create a user account named postgres in the system. We can use the postgres account to test the installation. For that, we need to change to the user postgres from terminal.

$sudo su postgres

Now use the following command to access PostgreSQL from the terminal.

$psql

You’ll see something like this on your terminal:

We can use the account postgres to access and work on the database. However, as a rule of thumb, it is better to create a superuser and access the database using that account. To create a superuser account, we need to get out of Postgres front-end and get back to the user account postgres. use the following command to get out of Postgres terminal front-end.

#\q

Now that we are in postgres system account, use the following command to create a superuser in interactive mode.

We have created a superuser account yelpguru. Now we can setup a password for the user from the Postgres terminal front-end.

 #ALTER USER yelpguru PASSWORD 'your_password'; 
 #\q

Switch to your system user account from the postgres account.

$exit

If we want to use access this database remotely, we need to setup Postgres for remote access using pgAdmin or any other database management tool. We need to change two files to achieve this - postgresql.conf and pg_hba.conf. We’ll need to declare the allowed IP address range in these files. For this case, I’ll allow all IP address to access Postgres remotely.

First, open the file postgresql.conf using the following command. $sudo nano /etc/postgresql/9.5/main/postgresql.conf

Look for the line #listen_addresses = 'localhost' and change it to listen_addresses = '*'.

Save the changes. Next, open the file pg_hba.conf.

$sudo nano /etc/postgresql/9.5/main/pg_hba.conf

Add the line host all all 0.0.0.0/0 md5 in the appropriate block.

Save the changes and exit. Restart Postgres from terminal for the changes to take effect.

$sudo service postgresql restart

Now install pgAdmin on your system. If you’re using Postgres locally, you can install it from the terminal.

$sudo apt-get install pgadmin3

You can now login to Postgres through pgAdmin using the superuser user credentials.

The next step is to set up Python modules for parsing the dataset. I am using Python 3 for the project, along with psycopg2 module to connect to Postgres. but before we can install psycopg2 on Python 3, we need to install pip3 and libpq-dev.

$sudo apt-get install python3-pip
$sudo apt-get install libpq-dev

Now we can istall psycopg2 on Python3.

$sudo pip3 install psycopg2

Access python3 from the terminal, and import psycopg2 to check if it has been installed properly. I have ran into problems related to permissions when installing psycopg2 on a local machine using sudo. If you run into such problems, switch to root using sudo su and run pip3 install psycopg2.

If you have been able to setup all these tools, then you’re ready to parse the data on a Postgres database using the parser I shared on GitHub.

© 2024 Manoj Pravakar