-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
339 lines (212 loc) · 16.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
<!DOCTYPE html>
<!--[if IEMobile 7 ]><html class="no-js iem7"><![endif]-->
<!--[if lt IE 9]><html class="no-js lte-ie8"><![endif]-->
<!--[if (gt IE 8)|(gt IEMobile 7)|!(IEMobile)|!(IE)]><!--><html class="no-js" lang="en"><!--<![endif]-->
<head>
<meta charset="utf-8">
<title>Michael Moss</title>
<meta name="author" content="Michael Moss">
<meta name="description" content="">
<meta name="keywords" content="">
<!-- http://t.co/dKP3o1e -->
<meta name="HandheldFriendly" content="True">
<meta name="MobileOptimized" content="320">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="canonical" href="http://michaelmoss.github.io">
<link href="/favicon.png" rel="icon">
<link href='http://fonts.googleapis.com/css?family=Quicksand:300,400' rel='stylesheet' type='text/css'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:400,300' rel='stylesheet' type='text/css'>
<link href="/stylesheets/screen.css" media="screen, projection" rel="stylesheet" type="text/css">
<link href="/atom.xml" rel="alternate" title="Michael Moss" type="application/atom+xml">
<script src="/js/jquery.js"></script>
<script src="/js/bootstrap-collapse.js"></script>
<script src="/js/modernizr-2.0.js"></script>
<script src="/js/octopress.js" type="text/javascript"></script>
<script src="/js/application.js"></script>
<!--Fonts from Google"s Web font directory at http://google.com/webfonts -->
<link href="http://fonts.googleapis.com/css?family=PT+Serif:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<link href="http://fonts.googleapis.com/css?family=PT+Sans:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-38629380-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
</head>
<body >
<div class="navbar navbar-inverse navbar-static-top">
<div class="navbar-inner">
<div class="container">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".navbar-responsive-collapse">
<span class="fui-menu-24"></span>
</a>
<div class="nav-collapse collapse navbar-responsive-collapse" style="height:0;">
<ul class="nav">
<li class="active"><a href="/index.html">Home</a></li>
<li ><a href="/books.html">Books</a></li>
</ul>
<ul class="nav pull-right">
<li><a href="http://github.com/michaelmoss" title="Github Profile"><i class="icon-github-sign social-navbar"></i></a></li>
<li><a href="http://linkedin.com/in/michaelrmoss" title="Linkedin Profile"><i class="icon-linkedin-sign social-navbar"></i></a></li>
<li><a href="http://twitter.com/michaelmoss" title="Twitter Profile"><i class="icon-twitter-sign social-navbar"></i></a></li>
<li><a href="http://plus.google.com/102882958270067938384" title="Google+ Profile"><i class="icon-google-plus-sign social-navbar"></i></a></li>
<li><a href="mailto:[email protected]" title="Email"><i class="icon-envelope-alt social-navbar"></i></a></li>
</ul>
</div>
</div>
</div>
</div>
<div class="container" id="main">
<div class="row-fluid">
<div id="content">
<div class="jumbotron">
<div class="container">
Michael Moss
<h3 class="tagline">
Startups, Software, Data, Leadership
</h3>
</div>
</div>
<div class="blog-index">
<article>
<div class="row-fluid">
<div class="span2 post-meta">
<h5 class="date-time">
<i class="icon-calendar-empty"></i> <time datetime="2011-07-10T22:47:00-04:00" pubdate data-updated="true">Jul 10<span>th</span>, 2011</time></h5>
<div class="row-fluid">
</div>
<div class="row-fluid">
<a href="/blog/categories/nosql/"><span class="badge">NoSQL</span></a>
<a href="/blog/categories/scalability/"><span class="badge">Scalability</span></a>
</div>
</div>
<div class="span10 post-container">
<h1 class="link"><a href="/blog/2011/07/10/user-history-data-cassandra-vs-mongodb/">User History Data: Cassandra vs. MongoDB</a></h1>
<p>Hello, there. Today I want to write about a new project I am working on, and why I picked Cassandra over MongoDB. This post assumes you have a basic understanding of each technology. I also want to preface this post by saying that while I really like Cassandra for this use case, we have been happily and successfully using MongoDB since I transitioned most of our user data storage over from Postgres (and Hibernate hell), but this is a topic for another post.</p>
<p>This new project is about storing music ‘listen’ history across our web app and mobile clients. That is, whenever one of our users listens to a song, we want to track who played the song, what song was played, the context of where the song was played from (a playlist, a radio station, an album), and when the song was played. In Java, the object might look like this:</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='java'><span class='line'><span class="kd">public</span> <span class="kd">class</span> <span class="nc">ListenHistoryEvent</span>
</span><span class='line'><span class="o">{</span>
</span><span class='line'> <span class="kt">int</span> <span class="n">userId</span><span class="o">;</span>
</span><span class='line'> <span class="kt">int</span> <span class="n">songId</span><span class="o">;</span>
</span><span class='line'> <span class="n">Enum</span> <span class="n">contextType</span><span class="o">;</span>
</span><span class='line'> <span class="n">String</span> <span class="n">contextId</span><span class="o">;</span>
</span><span class='line'> <span class="n">Date</span> <span class="n">datePlayed</span><span class="o">;</span>
</span><span class='line'><span class="o">}</span>
</span></code></pre></td></tr></table></div></figure>
<p>Users need to be able to retrieve their own song history, and should be able to see the song history of their friends, but only 10-20 items at a time, in order. There will never be a case, where we need to pull multiple users’ history from storage at once (range scans).</p>
<p>Considering we are a music service, it should not be surprising that our #1 API call is for streaming music, so whatever storage engine we choose will have to sustain a constant, high write rate. As for the reads, I expect a much lower rate since I would assume that users won’t be checking their listen history after each listen (or perhaps even after every 10th or 100th listen). With that, I’m going into this project designing for a 1:10 read/write ratio.</p>
<p>Let’s see how Cassandra and MongoDB compare for implementing this project and requirements.</p>
<h3>CAP and Overall Design</h3>
<p>I think a great place to start is with the CAP theorem. Cassandra is classified as an AP system, where MongoDB is classified as a CP system. For this project, I need a system that is always available to sustain a high volume of writes, and I would rather gain overall performance even it if means potentially sacrificing consistency on reads. Let’s face it, 99% of users will likely be checking their listen history every 10-100 songs, if they do at all, but in the unlikely event that they are glued to their listen history page, and refreshing it maniacally as they listen to music, I would be OK if sometimes they didn’t see the latest data (think Facebook news feed) as long as the storage tier is able to scale all of those writes.Cassandra and MongoDB are both ‘P’ systems because they scale by partitioning and distributing load across multiple physical servers. MongoDB is ‘C’ leaning, because assuming you do shard a DB/collection, there is always a specific and designated master node for each _id (document) being written. If that master (shard) goes down, and there is no replica to replace it, you’ve got nowhere to write that data.</p>
<p>Cassandra is slightly different, and hence it’s ‘A’ leanings. While there are also designated nodes for each key (row) that is being written, none of these nodes need to actually be up for the cluster to accept writes for it. With hinted handoff, the cluster will always be available to accept writes.</p>
<p>This is what I am looking for.</p>
<h3>Persistence</h3>
<p>Not only does Cassandra offer better write availability, it also shines at write performance, which is critical for such a write-heavy project. As of MongoDB version 1.8, it still has that global write lock which would impact performance not only on the collection being written to, but other collections as well across the cluster. Cassandra’s BigTable inspired data model which writes in an append-only fashion really shines here with much (much) lower contention.</p>
<h3>Storage Format and Memory Management</h3>
<p>Let’s transition a bit and talk about schema for a moment. If I were to structure this in MongoDB, I’d have two obvious choices:</p>
<ul>
<li>Each listen event is a document
<ul>
<li>Pros
<ul>
<li>Simple to program to</li>
</ul>
</li>
<li>Cons
<ul>
<li>Paginated reading of listen events would require a compound index with userId and datePlayed.</li>
<li>That is a lot of overhead for data that is not often read.</li>
</ul>
</li>
</ul>
</li>
<li>Each user is a document and has embedded listen events
<ul>
<li>Pros
<ul>
<li>Organized – I prefer this structure as it keeps things together by user and leverages the benefits of the document format nicely.</li>
<li>Simple $slice operator can be used to handle pagination</li>
</ul>
</li>
<li>Cons
<ul>
<li>16MB doc limit could create an issue for users who listen to a lot of music. Assuming 100 bytes/embedded listen event, that’s about 160K songs/user.</li>
<li>Managing this array will create some code complexity until the 10Gen folks implement pushing to fixed sized arrays. I also wouldn’t want to transfer 16MB of document over the wire to the app server to implement this logic.</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>So, while it’s definitely possible to implement this with Mongo in several ways each with their pros and cons, I think Cassandra’s model is a better fit for this project. I’ve chosen what some call the ‘wide row’ model where each user is a row, and each song listen event is a column in a single Column Family with the column key as the listening time and the column value is the event payload. It would look as follows in JSON format:</p>
<figure class='code'><figcaption><span></span></figcaption><div class="highlight"><table><tr><td class="gutter"><pre class="line-numbers"><span class='line-number'>1</span>
<span class='line-number'>2</span>
<span class='line-number'>3</span>
<span class='line-number'>4</span>
<span class='line-number'>5</span>
<span class='line-number'>6</span>
<span class='line-number'>7</span>
<span class='line-number'>8</span>
</pre></td><td class='code'><pre><code class='javascript'><span class='line'><span class="nx">SongHistory</span> <span class="o">=</span>
</span><span class='line'><span class="p">{</span> <span class="c1">// ColumnFamily</span>
</span><span class='line'><span class="nx">userId1</span><span class="o">:</span><span class="p">{</span> <span class="c1">// Row in the SongHistory CF, using RandomPartitioner</span>
</span><span class='line'> <span class="nx">TimeUUID1</span><span class="o">:</span> <span class="p">[</span><span class="nx">Song</span> <span class="nx">History</span> <span class="nx">Data</span><span class="p">],</span> <span class="c1">//Each column has TimeUUID as key, and are sorted by key</span>
</span><span class='line'> <span class="nx">TimeUUID2</span><span class="o">:</span> <span class="p">[</span><span class="nx">Song</span> <span class="nx">History</span> <span class="nx">Data</span><span class="p">]</span> <span class="c1">//Each column value is a song listen },</span>
</span><span class='line'><span class="nx">userId2</span><span class="o">:</span> <span class="p">{</span><span class="c1">// Another row</span>
</span><span class='line'> <span class="nx">TimeUUID1</span><span class="o">:</span> <span class="p">[</span><span class="nx">Song</span> <span class="nx">History</span> <span class="nx">Data</span><span class="p">]</span> <span class="p">}</span>
</span><span class='line'><span class="p">,}</span>
</span></code></pre></td></tr></table></div></figure>
<p>Rather than having MongoDB’s 16MB document limit, Cassandra has a liberal 2GB column limit per row, which gives me quite a bit more headroom. Using Cassandra’s Random Partitioner will also give me great dispersion on my writes across my users (using MD5 on the row key, which is the userId). The column keys will be sorted using the TimeUUIDType comparator, which will keep my columns in order. Getting this level of dispersion using MongoDB sharding would likely require a separate special field on the document for this.</p>
<p>The big win here is that with this schema, Cassandra need not deserialize the entire column to read or write song events which will save a lot of server resources across millions of users.</p>
<p>Another point to quickly mention on this topic is how MongoDB and Cassandra each handle caching, another area where I think Cassandra shines for this project.</p>
<p>Because MongoDB uses mmap files, it will be caching all of our active listening users song history in memory. Since most of our users probably won’t be reading this data often, if at all, it would likely be a wasted resource, and if there is more important data on the same MongoDB cluster, but which gets written to less often, it would get pushed out from all of these writes. This is speculation and would have to be measured. From what I’ve read, Cassandra has a more granular caching structure which would cache individual rows and individual columns on an LRU basis. Given that users will first be reading their most recent writes, I think this is a good model for this project. As I load test this and do read simulations, I’ll be very interested in learning more how to tune my caching strategy on Cassandra.</p>
<h3>Conclusion</h3>
<p>Well, I hope this post was helpful to you. As I get further along in the project, I’ll post some updates. If you have any questions, I’d love to hear from you.</p>
</div>
</div>
</article>
<div class="pagination">
</div>
</div>
</div>
</div>
<div class="row-fluid">
<footer class="footer-page" role="contentinfo">
<p>
Copyright © 2014 - Michael Moss -
<span class="credit">Powered by <a href="http://octopress.org">Octopress</a></span> - Theme by <a href="http://alexgaribay.com">Alex Garibay</a>
</p>
</footer>
</div>
</div>
<script type="text/javascript">
var disqus_shortname = 'michaelmossgh';
var disqus_script = 'count.js';
(function () {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = 'http://' + disqus_shortname + '.disqus.com/' + disqus_script;
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
}());
</script>
<script type="text/javascript">
(function(){
var twitterWidgets = document.createElement('script');
twitterWidgets.type = 'text/javascript';
twitterWidgets.async = true;
twitterWidgets.src = 'http://platform.twitter.com/widgets.js';
document.getElementsByTagName('head')[0].appendChild(twitterWidgets);
})();
</script>
</body>
</html>