A new challenge: connect to MongoDB in python, and parse a resulting column data containing a list of dictionaries.
The first part was easy. There is a MongoClient connector using the “pymongo” module.
So this is what you need to do to connect to Mongo in python:
My system’s setup:
- macOS Sierra version 10.12.1
- python 2.7.12
- pymongo 3.3.0
- pandas 0.18.1
The process to connect and query MongoDB using pymongo:
# We import pymongo to connect to MongoDB and pandas for data processing
from pymongo import MongoClient
import pandas as pd
# create a connector to mongoDB
def connectorMongo(host, port, username, password, db):
if username and password:
mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db)
con = MongoClient(mongo_uri)
else:
con = MongoClient(host, port)
return con[db]
def queryMongo(host, port, username, password, db, collection, query, no_id=True):
db = connectorMongo(host=host, port=port, username=username, password=password, db=db)
cursor = db[collection].find(query)
#Now parse return from query to be used later using pandas
df = pd.DataFrame(list(cursor))
Here is a sample of my df, the data I am interested in reside under column “pinkData” which is a list of dictionaries (quick refresher on dictionaries and lists here):
A closer look at the first element of ‘pinkData’:
So how do we clean this up? After a few Google searches, I found this post by Wouter Overmeire to be closest to what I wanted to achieve. So, I iterate on each row element of ‘pinkData’, clean the element, then combine all together into a new dataframe that I label “df_pinkDataParsed”:
df_pinkDataParsed = []
row_index = []
frames = []
for index, row in df['pinkData'].iteritems():
row_index.append(index)
frames.append(pd.DataFrame(row))
df_pinkDataParsed = pd.concat(frames, keys=row_index)
# This is what I get:
# My queryMongo function now looks like:
def queryMongo(host, port, username, password, db, collection, query, no_id=True):
db = connectorMongo(host=host, port=port, username=username, password=password, db=db)
cursor = db[collection].find(query)
#Now parse return from query to be used later using pandas
df = pd.DataFrame(list(cursor))
df_pinkDataParsed = []
row_index = []
frames = []
for index, row in df['pinkData'].iteritems():
row_index.append(index)
frames.append(pd.DataFrame(row))
df_pinkDataParsed = pd.concat(frames, keys=row_index)
return df_parsed
# Now just create a connection using your data and confirm that it works.
db = 'pinkintello'
collection = 'Nonpinkstuff'
host = 'localhost'
port = 'someNumber'
username = 'myusername'
password = 'mypassword'
# Hypothetically pulling all records under the "somepoint"
query = {'somepoint':'somePinkNumber'}
mytest = queryMongo(host, port, username, password, db, collection, query, no_id=True)
Closing notes
In this example, my focus is to connect to Mongo and parse the list of dictionaries under a specific column, “pinkData”. Ideally, you would want to merge the parsed dataframe “df_pinkDataParsed” to the original dataframe “df”. In this case, you will need to be careful that you have matching indices. In my case, I choose to use “pinkTime” as the new index while joining my dataframes.
# Finally, your merged file will be
df_clean = df.append(df_pinkDataParsed)
This code works well for my purpose but am wondering if there is a way to optimize it. Any suggestion?