Storing raw message data #166

chylex · 2022-03-05T11:47:06Z

chylex
Mar 5, 2022
Maintainer

Opening an exploration of storing raw message data in the database. This could mean the eventual possibility of resolving some long-standing feature requests, related to data DHT does not currently store:

The obvious issue is that the database size would expand, possibly by an order of magnitude, so this would have to be an optional feature. .NET has native support for Brotli compression which should help reduce the size.

I will create a proof-of-concept, and determine what is the realistic expectation in terms of database size if this feature was enabled.

chylex · 2022-03-05T14:44:51Z

chylex
Mar 5, 2022
Maintainer Author

On a test set of ~2300 messages, the results are:

Method	Size
Disabled	648 kB
Uncompressed	4288 kB (6.6x)
Brotli (1)	2668 kB (4.1x)
Brotli (4)	2528 kB (3.9x)
Brotli (5)	2448 kB (3.8x)
Brotli (10)	2300 kB (3.6x)
Brotli (11)	2248 kB (3.5x)

Also noting that with Brotli 11, it took about 10x longer to store messages, which caused a noticeable slowdown in tracking speed. With Brotli 5, the difference in speed was negligible, so that seems to be the way to go. Quality levels 6-9 did not improve on size compared to 5.

Unfortunately there is no support for custom dictionaries yet, at least in the .NET APIs. That could be interesting to experiment with, since the message JSON format is fairly rigid.

0 replies

chylex · 2022-03-05T15:11:27Z

chylex
Mar 5, 2022
Maintainer Author

As a side note, I might split the main messages table. Most time, the columns with edit timestamp and reply id have no value, but it still takes a lot of space - in the test set, moving those two columns to separate tables reduced the size from 648 kB to 396 kB. I will have to make sure there isn't a noticeable degradation in speed in large databases, but it should be a major improvement.

3 replies

chylex Mar 5, 2022
Maintainer Author

On a 260K message database, it migrated in about 3 seconds and went from 42.3 MB to 30.4 MB (-28%). That is with 2.5K edited messages and 100 replies, which isn't entirely representative because people on most servers use the reply feature much more often, but any database where the majority of messages aren't edited or replies should benefit. Performance testing TBD.

TheTechRobo Mar 5, 2022

That'll especially help with scraping old Discord servers!

chylex Mar 5, 2022
Maintainer Author

Loading messages took on average pretty much exactly 1s on the migrated database, and about 950ms on the old one. That is when creating a viewer with all 260K messages.

TheTechRobo · 2022-03-05T16:20:46Z

TheTechRobo
Mar 5, 2022

Does .NET support zstd? That's a pretty amazing compressor, and it's really fast. It compressed a 700MB file in about 1 second for me! (https://gist.github.com/TheTechRobo/091acf60f80d007557ad35821fdb3a6d)

1 reply

chylex Mar 5, 2022
Maintainer Author

Not natively. There are third-party libraries for it, but I don't know whether they ship builds of zstd for all the different platforms supported by .NET, and if they do, I suspect it would massively increase the size of the portable build.

chylex · 2022-10-22T01:24:43Z

chylex
Oct 22, 2022
Maintainer Author

I have experimented a bit with LiteDB, which is an embedded JSON database. I have concerns about its stability compared to SQLite, but it would be a more natural way of storing Discord's data.

By default, LiteDB databases are quite large compared to SQLite. It also includes a lot of redundancy, so there are opportunities to reduce the size at the expense of time, and sometimes the ability to fully reconstruct the raw data, but I think it could be configurable. Most users probably don't need the raw raw data, which includes null values and empty arrays.

There is also a lot of redundancy, which could be compressed using knowledge of Discord's JSON formats and common values. It would not allow for simple extraction of raw data using external tools, but the app could support an export to JSON which would perform the decompression.

Tests on 25K messages:

Type	Size
LiteDB (Raw)	35 MB
LiteDB (No Empty Values)	29 MB
LiteDB (No Empty Values and Author)	14 MB
SQLite	4.8 MB

Note: "No Empty Values and Author" strips the message author information which is usually duplicated across messages. It would be possible to store author information separately, possibly even with support for bots/webhooks that might customize author information per-message. The final size would be larger, but likely not much larger than the 50% savings. I have not tested any custom compression at the moment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing raw message data #166

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Storing raw message data #166

chylex Mar 5, 2022 Maintainer

Replies: 4 comments · 4 replies

chylex Mar 5, 2022 Maintainer Author

chylex Mar 5, 2022 Maintainer Author

chylex Mar 5, 2022 Maintainer Author

TheTechRobo Mar 5, 2022

chylex Mar 5, 2022 Maintainer Author

TheTechRobo Mar 5, 2022

chylex Mar 5, 2022 Maintainer Author

chylex Oct 22, 2022 Maintainer Author

chylex
Mar 5, 2022
Maintainer

Replies: 4 comments 4 replies

chylex
Mar 5, 2022
Maintainer Author

chylex
Mar 5, 2022
Maintainer Author

chylex Mar 5, 2022
Maintainer Author

chylex Mar 5, 2022
Maintainer Author

TheTechRobo
Mar 5, 2022

chylex Mar 5, 2022
Maintainer Author

chylex
Oct 22, 2022
Maintainer Author