Skip to content
Deniz edited this page Dec 18, 2013 · 30 revisions

MongoDB Notes

##Prelude

###Installing MongoDB

brew install mongodb

All the executables are in the bin folder.

We are going to focus on mongod execture, which is the server, and mongo, which is the shell.

To start the server, we need to create a dbpath first.

sudo tcsh
mkdir /data/
mkdir /data/db
chmod 777 /data/db

We should be able to run the server now.

./mongod

Similarly, we start the shell

./mongo

###Install PyMongo

[sudo] pip install pymongo

##What is MongoDB and JSON?

MongoDB is a non-relational database that stores JSON documents. The JSON format looks much closer to the data types in your programs. It is schemaless i.e. two documents might have different schemas.

How is this different than SQL? In SQL world your database schema is agnostic to your application's access pattern. The SQL schema should reflect the third normal form. In NoSQL world, by contrast, the most important aspect is to keep to data in a way that is conducive to the application using the data.

http://docs.mongodb.org/v2.2/reference/sql-comparison/

Two basic datatypes in JSON: arrays and dictionaries. Documents can be created in many arbitrary ways.

{'course':'Data Science', 'students':['Charlie Brown','Sally Brown','Schroeder']}

http://www.json.org/

###Hello World, pymongo. Inserting a document.

getting_started.py

##CRUD

grades.js

mongoimport --db test --collection grades < grades.js

mongo_crud.py

CRUD (Create, Read, Update, Delete) is a common acronym for describing the kinds of manipulations on database records. CRUD operations exist as methods or functions in programming language APIs. MongoDB does not have its own query language.

###Update -Update method needs to return a match. With no match, update does have no effect.

Two methods to update a document.

  1. Whosale Replacement.

  2. Partial Update with '$set' operator

In contrast, you can pass the upsert method that will create a new document even if there is no match. A simple upsert inserts the condition on the right. If you use the $set operator, it will insert the first comment as well.

If there are multiple documents matching the query, mongo picks a single one. To update multiple documents, you need to specify 'multi' operator explicitly.

//db.grades.find({'student_id':0})
db.grades.update({'student_id':0},{'$set':{'SAT':600}},{multi: true})
//db.grades.update({},{'$unset':{'SAT':1}},multi=True)

##Performance

mongodb_index.py

create_foo.js

Database performance is driven by pretty much the same thing every database system is driven by, whether an index is used to perform the task or not.

Index

###Explain

create_scores.js

Explain method tells you exactly how the index is used.

http://docs.mongodb.org/manual/reference/method/cursor.explain/

//No index
db.students.find({student_id:1}).explain()
//Put an index
db.students.ensureIndex({student_id:1})
//With index
db.students.find({student_id:1}).explain()

Parameters:

Cursor: Gives the index used.
isMultiKey: Array index or not. 
n: number of documents returned
nscanned: Number of scanned objects
nscannedobjects: Number of scanned objects and indices.

You want n and nscanned as close as possible!

indexBounds: Shows the bounds to look up in the index.
indexOnly: If you need to go to the actual collection for the answer, including '_id'

If you only the field you want is in the index, this will be True (called covered index)

###Multi-Key Index

MongoDB allows you to do an index on an array. But you can only have a single array within an index.

Unique Indexing

You can force the index to be unique.

db.peanuts.drop()
db.peanuts.insert({'character':'Charlie Brown'})
db.peanuts.insert({'character':'Charlie Brown'})
db.peanuts.insert({'character':'Snoopy})
db.peanuts.ensureIndex({'stuff':1},{unique=true})

What if you want to create an index and remove the duplicate keys? There is no way to control which document will be removed, however.

db.peanuts.ensureIndex({'character':1},{unique=true},{dropDups=true}))

Note that indices take up space and it is important to have them in memory for performance reasons.

db.students.stats()
db.students.getIndices()
db.students.totalIndexsize()

###Geospatial Indexes

theater.json

Indexes will allow you to find locations.

db.location.ensureIndex({'location': '2d' })
//Find the 10 arts organizations closest to us. 
db.location.find({'location':{'$near':[40.736877,-73.991557]}}).limit(10)

You can use geospatial index that considers the curvature of the earth.

//Collection needs to specified explicitly. 
//Specify location and radiance
db.runCommand({'geoNear':'location','near':[40.736877,-73.991557],'spherical':true, maxDistance: 0.05/3959})
db.location.find( { location: { $geoWithin: { $centerSphere: [ [ 40.736877,-73.991557]  , 0.05 / 3959 ] } } } )
//No Distance
db.runCommand({'geoNear':'location','near':[40.736877,-73.991557],'spherical':true, distanceMultiplier:3959})

##Aggregation

Aggregation framework in MongoDB is analogous to 'GROUP BY' statement in SQL and uses a pipeline concept similar to UNIX. It is a powerful tool to summarize the data. The limitation is that the document returned cant be larger than 16MB.

products.js

simple_aggregation.js

###Aggregation Pipeline Note that each function can be used multiple times at different stages.

$project - select which keys you are interested in. 1-1. If it sees 10 docs, produces 10.
$match - filter early on in the pipeline. n:1
$group - aggregation function. n:1
$sort - sort 1-1
$skip - skip forward only make sense after sort. n:1
$limit - limit. 
$unwind - produce a document for every sub document. normalizes the data.

Aggregation Expressions include $sum, $avg, $min, $max, $push, $addToset, $first, $last


$sum operator

db.products.aggregate([{$group:{_id: {'manufacturer':'$manufacturer'}, 'sum_prices':{$sum:'$price'}}}])

$push operator

db.products.aggregate([{$group:{_id: {'manufacturer':'$manufacturer'}, 'items':{$addToSet:'$name'}}}])

zip.json

$project is used to reshape documents. You can use simple functions like $toUpper, $toLower, $add, $multiply

db.zip.aggregate([{$project:{_id:0, city:{$toLower:"$city"}, pop:1, state:1, zip:"$_id"}}])

$match performs a filter similar to 'find'.

db.zip.aggregate([
{$match: {state: 'NY'}},
{$group: {_id:'$city',
		'population': {$sum: '$pop'},
		'zip_codes': {$addToSet: '$_id'}
		}},
{$project: {_id:0,
	'city': '$_id',
	'population':1,
	'zip_codes':1
}}
])

$sort could go before or after group phase. The method could be a real memory-hog and

db.zip.aggregate([
{$match: {state: 'NY'}},

{$group: {_id:'$city',
		'population': {$sum: '$pop'},
		'zip_codes': {$addToSet: '$_id'}
		}},

{$project: {_id:0,
	'city': '$_id',
	'population':1,
}},
{$sort: {population:-1}},
{$limit: 5}
])

$unwind flattens the arrays.

db.inventory.aggregate([
	{$unwind:'$sizes'},
	{$unwind: '$colors'},
	{$group:{ 
		'_id': {'size': '$sizes', 'color': '$colors'},
		'count': {$sum : 1}
	}
}])

Please calculate the average population of cities in California and New York taken together with populations over 25,000. For this problem, assume that a city name that appears in more than one state represents two separate cities.

db.zip.aggregate([ 
	{$match:{state:{$in:['NY', 'CA']}}}, 
	{$group:{_id : {"state" : "$state", "city" : "$city"}, 
		"population":{$sum : "$pop"}}}, 
	{$match : {population : {$gt:25000}}}, 
	{$group:{_id:null, avg:{$avg:'$population'}}} ])

mongo_aggregate.py

dropbox links: twitter user and mention collections