MongoDB and data parsing in Python

A new challenge: connect to MongoDB in python, and parse a resulting column data containing a list of dictionaries.

The first part was easy. There is a MongoClient connector using the “pymongo” module.

So this is what you need to do to connect to Mongo in python:

My system’s setup:

  • macOS Sierra version 10.12.1
  • python 2.7.12
  • pymongo 3.3.0
  • pandas 0.18.1

The process to connect and query MongoDB using pymongo:

# We import pymongo to connect to MongoDB and pandas for data processing
from pymongo import MongoClient
import pandas as pd


# create a connector to mongoDB
def connectorMongo(host, port, username, password, db):
if username and password:
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
con = MongoClient(mongo_uri)
else:
con = MongoClient(host, port)
return con[db]
def queryMongo(host, port, username, password, db, collection, query, no_id=True):
db = connectorMongo(host=host, port=port, username=username, password=password, db=db)
cursor = db[collection].find(query)
#Now parse return from query to be used later using pandas
df = pd.DataFrame(list(cursor))

Here is a sample of my df, the data I am interested in reside under column “pinkData” which is a list of dictionaries (quick refresher on dictionaries and lists here):

df_pinkintello_mongodb_connection1

A closer look at the first element of ‘pinkData’:

df_pinkintello_mongodb_connection2df_pinkintello_mongodb_connection3

So how do we clean this up? After a few Google searches, I found this post by Wouter Overmeire to be closest to what I wanted to achieve.  So, I iterate on each row element of ‘pinkData’, clean the element, then combine all together into a new dataframe that I label “df_pinkDataParsed”:

df_pinkDataParsed = []
row_index = []
frames = []
for index, row in df['pinkData'].iteritems():
row_index.append(index)
frames.append(pd.DataFrame(row))
df_pinkDataParsed = pd.concat(frames, keys=row_index)

# This is what I get:

df_pinkintello_mongodb_connection5


# My queryMongo function now looks like:


def queryMongo(host, port, username, password, db, collection, query, no_id=True):
db = connectorMongo(host=host, port=port, username=username, password=password, db=db)
cursor = db[collection].find(query)
#Now parse return from query to be used later using pandas
df = pd.DataFrame(list(cursor))


df_pinkDataParsed = []
row_index = []
frames = []
for index, row in df['pinkData'].iteritems():
row_index.append(index)
frames.append(pd.DataFrame(row))
df_pinkDataParsed = pd.concat(frames, keys=row_index)
return df_parsed

# Now just create a connection using your data and confirm that it works.
db = 'pinkintello'
collection = 'Nonpinkstuff'
host = 'localhost'
port = 'someNumber'
username = 'myusername'
password = 'mypassword'

# Hypothetically pulling all records under the "somepoint"
query = {'somepoint':'somePinkNumber'}

mytest = queryMongo(host, port, username, password, db, collection, query, no_id=True)

Closing notes


In this example, my focus is to connect to Mongo and parse the list of dictionaries under a specific column, “pinkData”.  Ideally, you would want to merge the parsed dataframe “df_pinkDataParsed” to the original dataframe “df”. In this case, you will need to be careful that you have matching indices. In my case, I choose to use “pinkTime” as the new index while joining my dataframes.

df_pinkintello_mongodb_connection4

df_pinkintello_mongodb_connection7

# Finally, your merged file will be

df_clean = df.append(df_pinkDataParsed)

This code works well for my purpose but am wondering if there is a way to optimize it. Any suggestion?

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s