-
Notifications
You must be signed in to change notification settings - Fork 27
/
Copy pathdatabase.py
578 lines (478 loc) · 27.8 KB
/
database.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2016 Pascal Jürgens and Andreas Jungherr
# See License.txt
"""
Database Access
---------------
Defines the database structure and provides access
Requirements:
- libraries peewee (database) and dateutil (parsing date strings)
Using SQL-based databases
=========================
This module implements a simple database that can be used to store structured twitter data. There are many other approaches (see below) that might be more appropriate for your use case. SQL-based databases impose strict restrictions on the data that can be stored. In particular, we need to define the structure of the data in advance by designing data *models*. These models guarantee that anything that is stored in the database conforms to our expectations. It is not possible to store incomplete (for example a Tweet missing its ID), duplicate (two Tweets with the same ID) or wrong data (such as writing the Tweet text into the ID field). By explicitly failing, the database makes errors visible that might have severe consequences for research but might otherwise go unnoticed.
The library used in this module (peewee) is one of many so-called "ORMs" (object relational mapper). It facilitates working with the database by abstracting many common tasks into methods that are easier to use and more consistent with the programming language. Other popular ORMs are SQLAlchemy and the Django ORM. Thanks to ORMs, you don't need to learn SQL (a programming language of its own) just to work with your data. Still, there are a few key facts that you will be comfortable with:
- Queries: We interact with databases through *queries*. Queries can fetch specific or all records, insert new data or change the way the database is structured.
- Tables: Just like an excel sheet or an R data frame, database tables represent a set of columns and rows. After defining a *model*, we create its corresponding table.
- Foreign / Many to Many Relationships: Most of the time, our models have some sort of links between them. For example, below we define a model named Tweet and a model named User. Each Tweet must have exactly one author, who in turn is a User object. Because Users and Tweets are stored in different tables, the database manages them via a feature called foreign keys. Instead of storing the entire User in the table that holds Tweets (which would duplicate the user data over and over again and make it hard to keep track of and search for), it just stores a link. If Tweet number 1 had User number 1 with username "alpha" as its author, the database representation of the Tweets "User" field would contain the number 1. Many to Many Relationships are similar but allow storing a list of things, such as a Tweet with multiple Hashtags.
- Joins: The database stores models in different tables. If we are interested in pieces of data on multiple tables, it needs to look at those tables in conjuction. A typical case would be searching for all Tweets from one username. First, we'd lookup the user with the given username, and then filter the Tweets table by this ID. In SQL, such an operation is called a JOIN. [REFERENCE: http://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins]
- Transactions: We stated that SQL databses only store valid data. They do so by checking any input first and only accepting it once it passes the test. The process of storing data is called a *transaction*: If it succeeds, the data is guaranteed to be present. If it fails, transactions are *rolled back* and the database returns to its previous state.
Peewee works with the most popular SQL baserd databases - among them PostgreSQL (our favorite for lage projects) and MySQL which run as stand-alone server applications. In an attempt to keep this tutorial simple and lightweight, we chose to instead use SQLite, an *in-process database*. SQLite does not need an external application and is transparent and intuitive as to where and how it stores its data. SQLite databases are simply files that can be stored in any directory you want. A program simply declares which database file it wants to use and the ORM handles all the details of storage. This makes the datasets portable, easy to handle and reason about. At the same time, the database allows us to search for specific content without needing to iterate over all the items. There is one caveat: As with raw files (see below), we strongly recommend against accessing the database from multiple programs at the same time. SQLite offers better data consistency guarantees than raw files, but you should still separate data storage and analysis operations.
One final benefit of peewee: Should you ever feel the need for larger, faster storage, you can just download PostgreSQL and swap it against SQLite. Your code will continue to run without any major changes (apart from the database configuration, of course).
Handling Usernames and IDs
==========================
Twitter's data schemata have some implications that are not readily apparent at a first glance. Before modeling the data strucutres used to store tweets, it pays to have a closer look at what the official documentation says about the uniqueness of objects. Tweets and users both have an ID field which contains a large number that is guaranteed to be unique. Ergo, there is no other user sharing that number and no other Tweet sharing one ID. However, the same does not apply for other attributes, notably usernames (screen_name). Although Twitter does not allow users to pick a name that is already chosen, usernames become available once the previous account has been deleted. In practice, this happens surprisingly often. The larger and the longer the timeframe of a data collection, the more likely it is that usernames collide.
The correct solution to this is to always use user IDs, not usernames as identifiers for users. The database schema below does this by defining the user model with an unique ID, but allowing (non-unique) any username. The example dataset provided with the package actually contains seven cases where usernames collide. You can find them by inspecting the list of usernames:
# Create a list of usernames from all Users:
usernames = [user.username for user in User.select('username')]
# Create a counting object from the list
from collections import Counter
namecount = Counter(usernames)
# Show the ten usernames that occur the most often across all users. The number following the string corresponds to the number of unique users with this name
print(namecount.most_common(10))
Out[1]: [('halseycupcakes', 2), ('bootsntwinks', 2), ('kilIdrake', 2), ('junhoestan', 2), ('StephenWolfUNC', 2), ('JackofKent', 2), ('oooh_sebastian', 2), ('neumantj', 1), ('JohnHollings', 1), ('Guidotoons', 1)]
A Special Note on Date and Time
===============================
Date and time are crucial to almost all kinds of data analysis, but at the same time they are notoriously difficult to handle. There are two major sources of confusion and errors, which correspond to the layers of modifications that are performed to get from an universal reference time to local time: (a) a timezone offset is added to a reference time to get local time and (b) if applicable, a daylight saving offset yields the seasonally correct time. (This presentation on date and time may be helpful: [https://www.youtube.com/watch?v=keUbVvMJeKY]).
When working with twitter, it is crucial to keep two things in mind and separate: The Twitter API always returns date/time in UTC, the universal reference time (no timezone offset, no daylight saving time!). However, depending on your object of study, users will see and use their local time. In many cases, we need to convert the native UTC dates into some other format. Some best practices for handling this are:
- Always store and/or explicitly declare the timezone information of your data
- Convert at the latest possible opportunity
- Store data in the most generic format possible (this usually means UTC)
Within this module, we have to compromise on date handling since SQLite does not support timezones. One option is storing the datetime as a string, but that prevents efficient date range queries. Instead, we chose to store UTC datetime without timezone information. So please keep in mind that you might need to convert both your queries and the resulting data. For example, if you're interested in tweets from New York City on January 1st of 2016, midnight to noon, you would first convert the range to UTC times:
from pytz import timezone
from pytz import utc
from datetime import datetime
# Hint: While it's possible to use timezone abbreviations such as "MST", some of those (such as CST) are ambiguous!
# While the pytz package will prevent you from using them, readers and users of your code might be fooled.
ny = timezone("America/New_York")
start_date = datetime(2015, 1, 1, 0, tzinfo=ny)
stop_date = datetime(2015, 1, 1, 12, tzinfo=ny)
start_date_utc = utc.normalize(start_date)
stop_date_utc = utc.normalize(stop_date)
Alternate ways of storing data
==============================
Caveat emptor: A comprehensive description of all data storage approaches is far beyond the scope of this tutorial. That being said, there are three options we wanted to highlight.
1. Store tweets as json (in files). The most trivial approach to storing data ist to just put it into a file. In python, one can choose among a handful of serialization formats such as Pickle, json and yaml. Since Tweets are almost always received as json, it is often a good idea to take the raw data as received and store it in a file. The usual approach (that GNIP uses as well) is to write one tweet per line. That allows for easy parsing and handling of the files. Compression can be used to make this kind of storage more space efficient. A typical way of creating such a file would look like this:
with open("my-tweets.json", "w") as f:
# MY_TWEETS would be the variable or generator function that holds your data
for tweet in MY_TWEETS:
f.write(json.dumps(tweet) + "\n")
The advantage of this format is definitely its ease of use and its portability. If you plan on using this for archiving your data, please make sure to safeguard against data corruption and data loss. Store the data in multiple places, use an error-tolerant filesystem such as ZFS or BTRFS or create recovery files with the PAR utility.
The major disadvantage of this format is that for each iteration of your analysis, you need to ingest the whole file. That is a process that can be time-consuming and inefficient, especially as your dataset grows. In addition, multiple programs accessing a single file at the same time are a recipe for disaster.
2. Schemaless databases (sometimes called NoSQL databases). There is an abundance of databases that allow data storage without the strict requirements of the schemata that are required by SQL. The most prominent example is mongodb, which stores arbitrary json. This ease of use can be a downside, as schemaless storage means that broken, wrong and partial data can be saved without noticing. Researchers must take special care to ensure that their datasets are consistent.
3. Distributed databases. Both SQL and schemaless databases may allow running co-ordinated clusters across multiple computers. Using such a large system is usually much too complex, expensive and error-prone to be useful for twitter research. We only recommend evaluating distributed storage if your dataset is larger than 1 TB and/or if you have access to existing infrastructure.
"""
import logging
import datetime
from dateutil import parser
from pytz import utc, timezone
MST = timezone("MST")
import peewee
from playhouse.fields import ManyToManyField
# The name for the database is hardcoded here. Go ahead and change it if you want to.
# If you need to alter the filename programmatically, either load it from some external
# place such as a file or define these models in your own code.
# Set up database
# Attention! Before using the database, we need to create its tables -
# see code at the end of the file
db = peewee.SqliteDatabase('tweets.db', threadlocals=True)
db.connect()
#
# Database Models: Define the structure of our database
#
class BaseModel(peewee.Model):
"""
Base model for setting the database to use
"""
class Meta:
database = db
class Hashtag(BaseModel):
"""
Hashtag model.
"""
tag = peewee.CharField(unique=True, primary_key=True)
class Language(BaseModel):
"""
Language model.
"""
language = peewee.CharField(unique=True, primary_key=True)
class URL(BaseModel):
"""
URL model.
"""
url = peewee.CharField(unique=True, primary_key=True)
class User(BaseModel):
"""
Twitter user model.
Stores the user's unique ID as a primary key along with the username.
"""
id = peewee.BigIntegerField(unique=True, primary_key=True)
username = peewee.CharField(null=True)
def last_tweet(self):
return Tweet.select().where(Tweet.user == self).order_by(Tweet.id.desc())[0]
def first_tweet(self):
return Tweet.select().where(Tweet.user == self).order_by(Tweet.id.asc())[0]
class Tweet(BaseModel):
"""
Tweet model.
Stores the tweet's unique ID as a primary key along with the user, text and date.
"""
id = peewee.BigIntegerField(unique=True, primary_key=True)
user = peewee.ForeignKeyField(User, related_name='tweets', index=True)
text = peewee.TextField()
date = peewee.DateTimeField(index=True)
tags = ManyToManyField(Hashtag)
urls = ManyToManyField(URL)
language = peewee.ForeignKeyField(Language, null=True)
mentions = ManyToManyField(User)
reply_to_user = peewee.ForeignKeyField(
User, null=True, index=True, related_name='replies')
reply_to_tweet = peewee.BigIntegerField(null=True, index=True)
retweet = peewee.ForeignKeyField(
'self', null=True, index=True, related_name='retweets')
#
# Helper functions for loading data into the database
#
def deduplicate_lowercase(l):
"""
Helper function that performs two things:
- Converts everything in the list to lower case
- Deduplicates the list by converting it into a set and back to a list
"""
valid = list(filter(None, l))
lowercase = [e.lower() for e in valid]
if len(valid) != len(lowercase):
logging.warning("The input file had {0} empty lines, skipping those. Please verify that it is complete and valid.".format(
len(lowercase) - len(valid)))
deduplicated = list(set(valid))
return deduplicated
def create_user_from_tweet(tweet):
"""
Function for creating a database entry for
one user using the information contained within a tweet
:param tweet:
:type tweet: dictionary from a parsed tweet
:returns: database user object
"""
user, created = User.get_or_create(
id=tweet['user']['id'],
defaults={'username': tweet['user']['screen_name']},
)
return user
def create_hashtags_from_entities(entities):
"""
Attention: Casts tags into lower case!
Function for creating database entries for
hashtags using the information contained within entities
:param entities:
:type entities: dictionary from a parsed tweet's "entities" key
:returns: list of database hashtag objects
"""
tags = [h["text"] for h in entities["hashtags"]]
# Deduplicate tags since they may be used multiple times per tweet
tags = deduplicate_lowercase(tags)
db_tags = []
for h in tags:
tag, created = Hashtag.get_or_create(tag=h)
db_tags.append(tag)
return db_tags
def create_language_from_tweet(tweet):
"""
"""
language, created = Language.get_or_create(language=tweet["lang"])
return language
def create_urls_from_entities(entities):
"""
Attention: Casts urls into lower case!
Function for creating database entries for
urls using the information contained within entities
:param entities:
:type entities: dictionary from a parsed tweet's "entities" key
:returns: list of database url objects
"""
urls = [u["expanded_url"] for u in entities["urls"]]
urls = deduplicate_lowercase(urls)
db_urls = []
for u in urls:
url, created = URL.get_or_create(url=u)
db_urls.append(url)
return db_urls
def create_users_from_entities(entities):
"""
Function for creating database entries for
users using the information contained within entities
:param entities:
:type entities: dictionary from a parsed tweet's "entities" key
:returns: list of database user objects
"""
users = [(u["id"], u["screen_name"]) for u in entities["user_mentions"]]
users = list(set(users))
db_users = []
for id, name in users:
user, created = User.get_or_create(
id=id,
defaults={'username': name},
)
db_users.append(user)
return db_users
def create_tweet_from_dict(tweet, user=None):
"""
Function for creating a tweet and all related information as database entries
from a dictionary (that's the result of parsed json)
This does not do any deduplication, i.e. there is no check whether the tweet is
already present in the database. If it is, there will be an UNIQUE CONSTRAINT exception.
:param tweet:
:type tweet: dictionary from a parsed tweet
:returns: bool success
"""
# If the user isn't stored in the database yet, we
# need to create it now so that tweets can reference her/him
try:
if not user:
user = create_user_from_tweet(tweet)
tags = create_hashtags_from_entities(tweet["entities"])
urls = create_urls_from_entities(tweet["entities"])
language = create_language_from_tweet(tweet)
mentions = create_users_from_entities(tweet["entities"])
# Create new database entry for this tweet
t, created = Tweet.get_or_create(
id=tweet['id'],
user=user,
text=tweet['text'],
# We are parsing Twitter's date format using a "magic" parser from the python-dateutil package
# The resulting datetime object has timezone information attached.
# However, since SQLite cannot store timezones, that information is stripped away.
# If you use PostgreSQL instead, please refer to the DateTimeTZField in peewee
# and remove the "strftime" call here
date=parser.parse(tweet['created_at']).strftime(
"%Y-%m-%d %H:%M:%S"),
)
if tags:
t.tags = tags
if urls:
t.urls = urls
if mentions:
t.mentions = mentions
if language:
t.language = language
if tweet["in_reply_to_user_id"]:
# Create a mock user dict so we can re-use create_user_from_tweet
reply_to_user_dict = {"user":
{'id': tweet['in_reply_to_user_id'],
'screen_name': tweet['in_reply_to_screen_name'],
}}
reply_to_user = create_user_from_tweet(reply_to_user_dict)
t.reply_to_user = reply_to_user
t.reply_to_tweet = tweet['in_reply_to_status_id']
if 'retweeted_status' in tweet:
retweet = create_tweet_from_dict(tweet['retweeted_status'])
t.retweet = retweet
t.save()
return t
except peewee.IntegrityError as exc:
logging.error(exc)
return False
#
# Helper functions to get summary statistics over the given database
#
def database_counts():
"""
Generate counts for objects in the database.
:returns: dictionary with counts
"""
return {
"tweets": Tweet.select().count(),
"hashtags": Hashtag.select().count(),
"urls": URL.select().count(),
"users": User.select().count(),
}
def mention_counts(start_date, stop_date):
"""
Perform an SQL query that returns users sorted by mention count.
Users are returned as database objects in decreasing order.
The mention count is available as ".count" attribute.
"""
# First we get the Table that sits between Tweets and Users
mentions = Tweet.mentions.get_through_model()
# The query - note the Count statement creating our count variable
users = (User.select(User, peewee.fn.Count(mentions.id).alias('count'))
# join in the intermediary table
.join(mentions)
# join in the tweets
.join(Tweet, on=(mentions.tweet == Tweet.id))
# filter by date
.where(Tweet.date >= to_utc(start_date), Tweet.date < to_utc(stop_date))
# group by user to eliminate duplicates
.group_by(User)
# sort by tweetcount
.order_by(
peewee.fn.Count(mentions.tweet).desc())
)
return users
def url_counts(start_date, stop_date):
"""
Perform an SQL query that returns URLs sorted by mention count.
URLs are returned as database objects in decreasing order.
The mention count is available as ".count" attribute.
"""
urlmentions = Tweet.urls.get_through_model()
urls = (URL.select(URL, peewee.fn.Count(urlmentions.id).alias('count'))
.join(urlmentions)
.join(Tweet, on=(urlmentions.tweet == Tweet.id))
.where(Tweet.date >= to_utc(start_date), Tweet.date < to_utc(stop_date))
.group_by(URL)
.order_by(peewee.fn.Count(urlmentions.tweet).desc())
)
return urls
def hashtag_counts(start_date, stop_date):
"""
Perform an SQL query that returns hashtags sorted by mention count.
Hashtags are returned as database objects in decreasing order.
The mention count is available as ".count" attribute.
"""
hashtagmentions = Tweet.tags.get_through_model()
hashtags = (Hashtag.select(Hashtag, peewee.fn.Count(hashtagmentions.id).alias('count'))
.join(hashtagmentions)
.join(Tweet, on=(hashtagmentions.tweet == Tweet.id))
.where(Tweet.date >= to_utc(start_date), Tweet.date < to_utc(stop_date))
.group_by(Hashtag)
.order_by(peewee.fn.Count(hashtagmentions.tweet).desc())
)
return hashtags
def retweet_counts(start_date, stop_date, n=50):
"""
Find most retweeted users.
Instead of performing a rather complex SQL query, we do this in more
readable python code. It may take a few minutes to complete.
It's possible to pass in a premade query that will be used as the
baseline for retweet counts (for example in order to limit
date ranges).
"""
from collections import Counter
rt = Tweet.alias()
rtu = User.alias()
baseline = (Tweet.select().
where(Tweet.date >= to_utc(start_date),
Tweet.date < to_utc(stop_date))
)
query = (baseline.select(Tweet.id, rt.id, rtu.id)
.join(rt, on=(Tweet.retweet == rt.id))
.join(rtu, on=(rt.user == rtu.id))
)
c = Counter(
(tweet.retweet.user.id for tweet in query)
)
# We use an ordered dict for the results so that the top results
# appear first
from collections import OrderedDict
results = OrderedDict()
for k, v in c.most_common(n):
results[User.get(id=k).username] = v
return results
#
# Helper functions for querying data
#
def tweetcount_per_user():
"""
This function executes a query that:
- joins the User and Tweet tables so we can reason about their Relationship
- groups by User so we have one result per User
- counts tweets per user and stores the result in the name "count"
- orders users by descending tweet count
The users objects are accessible when looping over the query, such as:
for user in query:
print("{0}: {1}".format(user.username, user.count))
Note that as always, this query can be augmented by appending further operations.
"""
tweet_ct = peewee.fn.Count(Tweet.id)
query = (User
.select(User, tweet_ct.alias('count'))
.join(Tweet)
.group_by(User)
.order_by(tweet_ct.desc(), User.username))
return query
def first_tweet():
"""
Find the first Tweet by date
"""
return Tweet.select().order_by(Tweet.date.asc()).first()
def last_tweet():
"""
Find the last Tweet by date
"""
return Tweet.select().order_by(Tweet.date.desc()).first()
def to_utc(dt):
"""
Helper function to return UTC-based datetime objects.
If the input datetime object has any timezone information, it
is converted to UTC. Otherwise, the datetime is taken as-is and
only the timezone information UTC is added.
"""
if dt.tzinfo is None:
logging.warning(
"You supplied a naive date/time object. Timezone cannot be guessed and is assumed to be UTC. If your date/time is NOT UTC, you will get wrong results!")
return utc.localize(dt)
else:
return utc.normalize(dt)
def objects_by_interval(Obj, date_attr_name="date", interval="day", start_date=None, stop_date=None):
"""
General helper function that returns objects by date intervals, mainly useful for counting.
WARNING: If used as-is with SQLite, all date/times in data and queries are UTC-based!
If you want to use local time for queries, take note that it will be converted correctly
ONLY if you supply the correct timezone information. In general, as long as you only
use timezone-aware objects you should be safe.
:param obj:
:type obj: database model
:param date_attr_name:
:type date_attr_name: the name of the filed containing date/time information on the model obj, as strings
:param interval:
:type interval: day (default), hour, minute as string.
:param start_date:
:type start_date: date to start from as datetime object, defaults to first date found.
:param stop_date:
:type stop_date: date to stop on as datetime object, defaults to last date found.
:returns: bool success
"""
# define intervals, then select the one given as a function argument
interval = {
"minute": datetime.timedelta(minutes=1),
"hour": datetime.timedelta(hours=1),
"day": datetime.timedelta(days=1),
}.get(interval)
date_field = getattr(Obj, date_attr_name)
# Determine first object if no start_date given
# Todo: Maybe prettify this humongous expression.
start_date = start_date or MST.localize(datetime.datetime(2015, 10, 27, 0))
stop_date = stop_date or MST.localize(
datetime.datetime(2015, 11, 2, 23, 59))
# If we wanted to use the first and last element instead of the pre-determined dates,
# this would be the way to do it:
# getattr(Obj.select().order_by(date_field.asc()).first(), date_attr_name)
# getattr(Obj.select().order_by(date_field.desc()).first(), date_attr_name)
# Ensure UTC times
start_date = to_utc(start_date)
stop_date = to_utc(stop_date)
# This iteration code could be shorter, but using two dedicated variables
# makes it more intuitive.
interval_start = start_date
interval_stop = start_date + interval
# Notice that this loop stops BEFORE the interval extends beyond stop_date
# This way, we may get intervals that do not reach stop_date, but on the other hand
# we never get intervals that are not covered by the data.
while interval_stop <= stop_date:
query = Obj.select().where(
date_field >= interval_start, date_field < interval_stop)
# First yield the results, then step to the next interval
yield ((interval_start, interval_stop), query)
interval_start += interval
interval_stop += interval
#
# Onetime setup
#
# Set up database tables. This needs to run at least once before using the db.
try:
db.create_tables([Hashtag, URL, User, Language, Tweet, Tweet.tags.get_through_model(
), Tweet.urls.get_through_model(), Tweet.mentions.get_through_model(), ])
except Exception as exc:
logging.debug(
"Database setup failed, probably already present: {0}".format(exc))