-
Notifications
You must be signed in to change notification settings - Fork 162
Home
These notes are for the master branch. Notes for the (legacy, but currently better supported) 1.0 branch are here
Note: current nanocube server implementation only works on 64-bit architectures
Suppose we want to create a nanocube for the following table in csv format:
latitude,longitude,time,device 44.00124456,-73.74023438,2013-02-21T12:49,android 42.0090331,-74.79492188,2013-04-11T13:58,iphone 45.68366959,-94.04296875,2013-02-28T17:33,iphone 37.97076194,-85.69335938,2013-04-17T05:04,android ...
Imagine for instance that this table stores check-in reports of a social website like foursquare or the defunct brightkite. In addition to location and time we have a categorical variable indicating which device was used to report the check-in. The goal we have in mind is to create a nanocube back-end for this dataset that will respond (at interactive rates) the queries needed on a front-end that has pannable and zoomable map to visualize a heatmap of counts of reports in the different regions and different times. We also want the user to filter these count reports by device.
To load this csv file into a nanocube we first need to convert it into
a file that is compatible with the nanocube tool called ncdmp
. We call
these files "dump" files and they use the extension ".dmp". A dump
file has always a text header followed by the end-of-header marker
(i.e. two line feeds: 0x0a, 0x0a) followed by data records. When we
have a nanocube ready .dmp file this is how we start a nanocube:
cat datafile.dmp | ncserve
Unfortunately not all .dmp files are ready to be streamed into a
nanocube. Only a subset of the possible .dmp files are compatible with
the nanocube startup program ncserve
. The tool ncdmp
is used to filter a general .dmp file
into one that is compatible with ncserve
. So the more common
way of starting a nanocube takes the general form of:
cat datafile.dmp | ncdmp (config parameters) | ncserve
In the ncdmp
command above is where we commit to the final aspect of
the nanocube we want to build. For example, latitude and longitude
floating-point values can be stored into a .dmp file, but are not
ready to be streamed into ncserve
. We need to define a binning
structure for space in, for example, the form of a k-levels
quadtree. In the (config parameters)
above is where we can
declare such binning scheme and k.
So, let's see how to load a nanocube from the .csv file above. We first generate a general .dmp file by executing
python csv2dmp.py
using the python script below:
import time
import struct
# dictionary to map device names to numeric values
# (<255 means we can fit it into a 1-byte categorical variable)
device_map = { "iphone":0, "android":1, "windows":2 }
# write header
output_file = open("example.dmp","w")
output_file.write("name: example\n")
output_file.write("encoding: binary\n")
output_file.write("field: latitude,float\n")
output_file.write("field: longitude,float\n")
output_file.write("field: checkin_time,uint64\n")
output_file.write("field: checkin_device,uint8\n")
for (k,v) in device_map.iteritems():
output_file.write("valname: checkin_device," + str(v) + "," + str(k) +"\n")
output_file.write("\n")
# write records based on the csv lines
input_file = open("example.csv","r")
line_no = 0
while True:
line_no += 1
line = input_file.readline().strip()
if not line:
break
if line_no == 1:
continue
tokens = line.split(",")
latitude = float(tokens[0])
longitude = float(tokens[1])
checkin_time = int(time.mktime(time.strptime(tokens[2],"%Y-%m-%dT%H:%M")))
checkin_device = int(device_map[tokens[3]])
# print latitude, longitude, checkin_time, checkin_device
# little endian: 4 bytes float, 4 bytes float, 8 bytes unix time, 1 byte device
data = struct.pack("<ffQB", latitude, longitude, checkin_time, checkin_device)
output_file.write(data)
input_file.close()
output_file.close()
Now we have a binary .dmp file called example.dmp. Note again that this .dmp file is not ncserve
compatible.
To start a nanocube with 25-levels quadtree with hourly resolution to count
the number of checkins we run the command.
cat example.dmp | \
ncdmp --encoding=b \
dim-dmq=pos,latitude,longitude,25 \
dim-cat=device,checkin_device \
dim-tbin=time,checkin_time,2013_1h,2 \
var-one=count,4 | ncserve
Note that for this command to work there needs to be a program named
nc_q25_c1_u2_u4
in the folder pointed by the environment variable
NANOCUBE_BIN
. This folder is where all the specific nanocube
programs are stored. If such a file doesn't exist one can set the
environment variable and run the script ncbuild
export NANOCUBE_BIN=$HOME/local/bin
ncbuild q25 c1 u2 u4
Once the nanocube start to get new data it also starts listening to some port which should be reported on the console. Through that port we can query the nanocube right away. For example if you open a browser and put the url (for port 29512)
http://localhost:29512/query
you get a json answer
{ "levels":[ ], "root":{ "addr":"0", "value":50.000000 } }
and the number 50.0 corresponds to the total number of records we have in the nanocube. If we want to drill down in device we can query
http://localhost:29512/query/@device=255+1
and get a json answer
{ "levels":[ "device" ], "root":{ "addr":"0", "children":[ { "addr":"0", "value":26.000000 }, { "addr":"1", "value":22.000000 }, { "addr":"2", "value":2.000000 } ] } }
The numbers 26, 22 and 2 are respectively the number of iphone,
android and windows checkins. Note the 255+1
in the query is an ugly
syntax for now, but the idea of this traversal rule is to indicate a
dimension address plus a number of levels down we want. The symbol @
indicates that we want to drilldown in that dimension (every result is
going to have a device
component in this case). The number 255 is
the address of the root node of a flat categorical tree of 1-byte or
c1
dimension. If we had a c2
dimension the root would be 65535. We
can use the same syntax of (address)+(offset)
to query a 256x256
pixels tile for the whole world. Here is the query
http://localhost:29512/query/@device=0+8
and here is the json result
{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"200000130000002d", "value":1.000000 }, { "addr":"20000013e000004b", "value":1.000000 }, { "addr":"200000136000002d", "value":1.000000 }, { "addr":"200000144000004b", "value":1.000000 }, { "addr":"20000013e000004c", "value":1.000000 }, { "addr":"2000001600000026", "value":1.000000 }, { "addr":"2000001380000029", "value":2.000000 }, { "addr":"20000010a0000051", "value":1.000000 }, { "addr":"2000001280000036", "value":1.000000 }, { "addr":"20000012e000003c", "value":1.000000 }, { "addr":"20000012a0000032", "value":1.000000 }, { "addr":"20000013a0000037", "value":1.000000 }, { "addr":"20000012e0000031", "value":1.000000 }, { "addr":"200000132000003e", "value":1.000000 }, { "addr":"20000013e000003b", "value":1.000000 }, { "addr":"2000001440000034", "value":1.000000 }, { "addr":"20000014c0000031", "value":1.000000 }, { "addr":"200000144000002c", "value":1.000000 }, { "addr":"200000140000004d", "value":1.000000 }, { "addr":"200000142000003d", "value":1.000000 }, { "addr":"200000142000003e", "value":1.000000 }, { "addr":"200000148000003d", "value":1.000000 }, { "addr":"200000140000004a", "value":1.000000 }, { "addr":"2000001380000032", "value":1.000000 }, { "addr":"2000001200000036", "value":1.000000 }, { "addr":"200000140000004c", "value":1.000000 }, { "addr":"20000013a0000028", "value":1.000000 }, { "addr":"200000146000004f", "value":1.000000 }, { "addr":"20000014c0000056", "value":1.000000 }, { "addr":"2000001580000054", "value":1.000000 }, { "addr":"2000001540000040", "value":1.000000 }, { "addr":"20000014c000002d", "value":1.000000 }, { "addr":"2000001300000044", "value":1.000000 }, { "addr":"200000150000004b", "value":1.000000 }, { "addr":"2000001300000036", "value":1.000000 }, { "addr":"20000013a000003f", "value":1.000000 }, { "addr":"2000001460000036", "value":1.000000 }, { "addr":"20000013a0000043", "value":1.000000 }, { "addr":"2000001340000047", "value":1.000000 }, { "addr":"2000001180000050", "value":1.000000 }, { "addr":"200000130000002c", "value":1.000000 }, { "addr":"20000014c0000058", "value":2.000000 }, { "addr":"20000011c0000049", "value":1.000000 }, { "addr":"20000015c000003d", "value":1.000000 }, { "addr":"2000001520000028", "value":1.000000 }, { "addr":"20000015c0000031", "value":1.000000 }, { "addr":"200000136000002a", "value":1.000000 }, { "addr":"2000001180000039", "value":1.000000 } ] } }
The addresses are 64-bit numbers written in hex that can be decoded into a quadtree node address with the following code snippet
x = address & 0x1FFFFFFF y = (address >> 29) & 0x1FFFFFFF level = (address >> 58)
In other words the quadtrees are limited to 29 levels (which is enough
for google maps like pan-and-zoom maps). The zero on 0+8
is the
address of the root node in any quadtree q1
... q29
dimension
(note that it is different from the root of the categorical trees c1
... c8
).
If we want only an image of the top-left quadrant of the world we can write
http://localhost:29512/query/@pos=qaddr(0,1,1)+8/device=0+0
and here is the result (all these values should sum up to 26)
{ "levels":[ "pos" ], "root":{ "addr":"0", "children":[ { "addr":"240000242000006d", "value":1.000000 }, { "addr":"24000025e0000062", "value":1.000000 }, { "addr":"240000264000007d", "value":1.000000 }, { "addr":"24000028a0000068", "value":1.000000 }, { "addr":"240000260000006d", "value":1.000000 }, { "addr":"2400002560000064", "value":1.000000 }, { "addr":"2400002820000095", "value":1.000000 }, { "addr":"240000276000006e", "value":1.000000 }, { "addr":"24000028a0000059", "value":1.000000 }, { "addr":"24000028e000006d", "value":1.000000 }, { "addr":"240000292000007a", "value":1.000000 }, { "addr":"240000284000007a", "value":1.000000 }, { "addr":"24000023200000a1", "value":1.000000 }, { "addr":"24000026c000005b", "value":1.000000 }, { "addr":"240000284000007d", "value":1.000000 }, { "addr":"24000029a00000ad", "value":1.000000 }, { "addr":"240000282000009a", "value":1.000000 }, { "addr":"2400002a80000080", "value":1.000000 }, { "addr":"24000028e000009f", "value":1.000000 }, { "addr":"2400002620000089", "value":1.000000 }, { "addr":"240000262000005a", "value":1.000000 }, { "addr":"24000025c0000078", "value":1.000000 }, { "addr":"2400002b80000063", "value":1.000000 }, { "addr":"24000021600000a3", "value":1.000000 }, { "addr":"2400002620000059", "value":1.000000 }, { "addr":"2400002320000073", "value":1.000000 } ] } }
The constraints in time are special. To for a daily timeseries for the first 20 days of 2013 we query
http://localhost:29512/query/@time=0:24:20
and the result is
{ "levels":[ "time" ], "root":{ "addr":"0", "children":[ { "addr":"d8000000f0", "value":0.000000 }, { "addr":"c0000000d8", "value":1.000000 }, { "addr":"a8000000c0", "value":1.000000 }, { "addr":"90000000a8", "value":0.000000 }, { "addr":"7800000090", "value":0.000000 }, { "addr":"6000000078", "value":0.000000 }, { "addr":"4800000060", "value":0.000000 }, { "addr":"3000000048", "value":0.000000 }, { "addr":"1800000030", "value":0.000000 }, { "addr":"18", "value":0.000000 }, { "addr":"f000000108", "value":0.000000 }, { "addr":"10800000120", "value":1.000000 }, { "addr":"12000000138", "value":0.000000 }, { "addr":"13800000150", "value":0.000000 }, { "addr":"15000000168", "value":0.000000 }, { "addr":"16800000180", "value":0.000000 }, { "addr":"18000000198", "value":1.000000 }, { "addr":"198000001b0", "value":0.000000 }, { "addr":"1b0000001c8", "value":0.000000 }, { "addr":"1c8000001e0", "value":0.000000 } ] } }
The 64-bit time addresses are decoded as an interval [a,b)
intervals
the first 32 bits is for b
and the most significal bits are for a
.
Here is a code snippet in python for the first bucket that shows up in
the result with address "d8000000f0" (note these buckets don't come in
order).
>>> addr = 0xd8000000f0
>>> b = addr & 0xFFFFFFFF
>>> a = (addr >> 32) & 0xFFFFFFFF
>>> a
216
>>> b
240
So it is the bucket that counts all events from hour 216 of year 2013 to hour 239 of 2013.
Nanocube building times are much much faster if the records are sorted in time (which we haven't done in this small example).