-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to keep media files for dictionary (audio, images, SVGs, video) #6
Comments
I'm not sure that this is a good idea. Many dictionary formats explicitly separate the main content and the (media) resources (stardict, dsl+zip in GoldenDict, MDX/MDD in MDict). One of the main reasons for that is that resources might be huge, taking Gigabytes of space. On mobile devices where the size is critical, users are free to not copy the media resources which will give them a working dictionary, albeit without the media. Also, some users might decide not to download the huge media files at all, just taking the main content. Personally, I'm also inclined to separate the text content and the binary data into separate files, that would give better flexibility for everybody. Just imagine editing an XML file which is 4 GB of size! :) |
This is definitely very reasonable: I was inclined to the same opinion. But what do you think about storing the icon and the cover image of the dictionary in main file? They cannot be that big as you say, so this wouldn't take up much space, what do you think? Also, do you think that optional storing meta_data in a separate file might be a good idea? |
Hmm, actually I think that storing both meta-data and the icon inside the main content file is a proper behavior, there is no need to keep many different files, this confuses users. Two files (the main content file and the additional media resources file) seems to be the most appropriate approach. When all the main data is in the single file, it is easier to parse and transfer dictionaries. As for external files with metadata, this can be done outside of the specification. For example, in GoldenDict we conside to introduce such format-independent files so that users might adjust dictionary name, provide custom icons, etc, without modifying the original file. |
I am very grateful for your feedback and opinion and I agree with you. But do you think it is possible to involve other Goldendict community members to decide this important details so that the best solution is reached? Speaking further on the topic, I was wondering if packing both dictionary and media files in a simple *.zip archive would be a nice practice. Since some dictionary files (uncompressed articles) may take up to dozens of megabytes -- and this is with no media files involved. So reading description from a series of dictionaries would imply unpacking gigabites of data. So do you think we need to store dicts in archives or not? UPD. This unpacking might be also important, since when Goldendict indexes the dictionary files it remembers the exact file offsets for each word-article, abd I'm not sure it's possible with the comperessed files. |
You could always summon them via @goldendict/developers, if that works from external repository, probably not....
Nope, it won't work. We need the offset-based access to the main content and zip doesn't work. For that we use dictzip, which allows to do that. In short, the main content (dictionary itself) can/should be compressed with dictzip, the media resources (images, audio, video) can/should be compressed with regular zip (but one need to be careful about file names encoding in such a zip file).
With dictzip this is not a problem. GoldenDict already handles, e.g., dsl.dz (dictzip compressed DSL files) with no issues. |
I'd say that ability to compress the XDXF dictionary (via dictzip) is a matter external to the XDXF specification. Some tools might decide to handle such compressed dictionaries, some others might prefer other means or only handle the uncompressed data. |
Wow, thank you so much telling about this dictzip software: I was wondering why they have *.dz extension. PS. Please let's also discuss tables/grammar issue #5 |
I was also thinking: since some people would want to use xdxf files without *.gz, then we would need to have our file CRC32 checksumed, but we are not able to checksum the file if the checksum must be in the meta_info section before we start computing it. Haha =) Maybe *.gz should be made obligatory? |
@soshial, I'm not sure that putting the CRC32/MD5 checksums inside the XDXF file is a good idea. You see, every time a user modifies such XDXF file, he/she must somehow recalculate the checksum, which is annoying and inconvenient and requires some external tools to do so. Personally, I don't think we need any checksumming at all for plain old text files. There are special tools to calculate and check the checksums, no need to put them into dictionary file directly. Mandatory compression is also not as flexible as I'd like. Many users do modify their dictionaries from time to time, correcting the typos, adding new entries, etc. Extracting the dictionary, modifying it and re-compressing is just too much extra work for such use cases. Consider GoldenDict. Even when it is open, user can open XDXF file, modify it and then press Ctrl+F5 to rescan the dictionaries, that will propagate the changes immediately. No need to compress/decompress anything, no need to calculate checksums, very fast and convenient way. In fact users could even provide an editor command line and start editing the files right from context menu, in simple text editors. |
Very good point, thank you. |
I like treating XDXF artifacts just like Open/LibreOffice documents or Java JAR/WAR files. i.e. conceptually they're "a directory tree contaning at least an .xdxf file, with one or more media files". Whether these are:
is a "deployment detail", tools should be able to access any of these uniformly. (just like in Java, the program doesn't care where or how you put a dependency class, as long as it's available in the classpath). |
@Tvangeste isn't it reasonable to put media into dictzip as well, since most of the images and sounds can fir into 1 or several 64kb blocks? This way media files will be random accessible too. |
Real dictionary observation: I have a dictionary made of page scans and keys referencing these pages. One page normally contains 3-5 articles, so one image is referenced by several articles. Slob format saves images directly into the dictionary file and lets referencing images as external files: The same dictionary encoded into StarDict format with images embedded (base64) into the articles, as Embedding images directly makes file 3.7 time bigger, because same images are repeated several times. Comparing different formats is perhaps not absolutely correct, but in fact they store data alike. So, think about centralized storage of images and inter-XML links (like abbreviations?). P.S. I personally support idea of several files (dictionary, media, css, js) compressed into a zip (or similar) archive. It would be easier replacing or editing images without need of programming. P.P.S. Having 2 files, one for dictionary and another for media, seems convenient. But having experience of supporting MDict (two files: mdx + mdd) I could say users ask me repeatedly "why are two files there" and "which one should I download". So, "one file to rule them all" is better. |
The Open Container Format (used in ePub) is a mature (v3.2) W3C standard, which is very similar to the OpenDocument container format, the Java Archive (JAR) format and many others. It describes a ZIP archive, where the very first file in the archive contains the media-type (for example There must be a META-INF directory, which contains meta data, that the file format needs. The rest is specified for the requirements of ePub documents. Other, familiar containers are several Java archive containers (JAR, WAR, etc.) I would base XDXF files strictly (the spec has been written, tested, so no need to brew a new one, makes it easy for users) on this format, but configure the file- and directory names, where needed.
In addition, I would specify a flat-file XDXF, used for those dictionaries, that do not need any assets, but can be transported safely as put XML. Assets would always be linked. And I would adopt XLink for any linking in XDXF. |
This is an important question. There is a solution to make keep the whole dictionary in 1 file: to imprint all media to the XDXF xml with base64 encoding. Do you think it is reasonable to do?
The text was updated successfully, but these errors were encountered: