Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve subtitle filename rename #23

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

berkanteber
Copy link
Member

Closes #22.

@berkanteber
Copy link
Member Author

From issue:

We have 3 types of subtitles:

  • Folder subtitles: We look at current directory for a .srt file. So, name always have extension. We don't detect language for these, always unknown.
  • File subtitles: We extract these from file and save as ...-{lang}.srt. So, name always have extension and language code. {lang} here is how it is tagged in MKV, so may be improper (seems < 0.2%).
  • OpenSubtitles subtitles: These always have .srt extension. Name can include language code (depends how it's uploaded), but we have language code separately.

What are your thought on when we should append language code?

I think:

  • In cases we have language we append language. It's simple and shown name is always the actual subtitle name (minus extension). If name has right/wrong language in it, it stays in the name.
  • For folder subtitles we append und except name has explicit language in form {name}.{lang}.srt. Identifying the language wrong is IMO worse than saying Unknown. I saw this with a HI (hearing impaired) subtitle identified as Hindi. It also matches the behavior of our other apps (currently always marked Unknown without exception).
  • In all cases, make the filename {name}.{lang}.srt.

Another note: Kodi also removes common prefix with video name.

@berkanteber
Copy link
Member Author

@joelpurra What do you think of this?

@joelpurra
Copy link

joelpurra commented Oct 11, 2024

@berkanteber: have some thoughts. Won't have time to neither test this PR nor review closer until earliest next week; I'll reply further then. Brain dump:

  1. I assume that Kodi already can/will parse the subtitle filename to guess the language.
    • putio-kodi adding the language code to the filename when it is parsed from the filename is unnecessary. Kodi probably does general parsing better.
    • Externally acquired information, perhaps tagging from opensubtitles, might help though. (I already spotted opensubtitle language tagging errors though.)
  2. The subtitle filenames sometimes seem cut/shortened/ellipsised (...).
    • This could be due to the full file path being too long.
    • Shortened names could lose information, in particular at the end.
    • I forget if it is actively shortened by putio-kodi, or somewhere else?
  3. Kodi can parse and extract subtitles directly from video files.
    • Unsure if this requires a setting.
    • Is duplicating file-internal subtitles confusing Kodi?
  4. Kodi makes use of both subtitle language and their subtitle variants.
    • There are useful settings to auto-enable CC/HI/SDH, but unsure exactly how they are applied for putio-kodi.
    • Closed Caption (CC)
    • Hearing-impaired (HI) -- sometimes gets parsed as Hindi language though.
    • Subtitles for Deaf and Hard of hearing (SDH)
    • Others variants?
    • If "manipulating" the filename/language, does putio also need to perform such steps to not lose valuable variant information?
  5. For undefined/unknown language, do not add anything and leave filename parsing to Kodi.

At this point I am actually unsure if manipulating subtitle filenames helps, or disrupts, Kodi's subtitle handling. Perhaps it's easiest to just leave all subtitle filename (assuming that the filename is not shortened) parsing to Kodi? Would have to investigate several test cases first.

@berkanteber
Copy link
Member Author

berkanteber commented Oct 11, 2024

Won't have time to neither test this PR nor review closer until earliest next week; I'll reply further then.

👍

I assume that Kodi already can/will parse the subtitle filename to guess the language.

  • putio-kodi adding the language code to the filename when it is parsed from the filename is unnecessary. Kodi probably does general parsing better.

  • Externally acquired information, perhaps tagging from opensubtitles, might help though. (I already spotted opensubtitle language tagging errors though.)

It does, but it does poorly (requires a certain way). This all started by me seeing Hindi for Hearing Impaired the other day and thinking we could do better. Can Kodi do better than us? For folder subtitles, maybe. For file subtitles and OpenSubtitles subtitles, probably no.

The subtitle filenames sometimes seem cut/shortened/ellipsised (...).

  • This could be due to the full file path being too long.

  • Shortened names could lose information, in particular at the end.

  • I forget if it is actively shortened by putio-kodi, or somewhere else?

I don't know about Kodi but we also sometimes shorten them. We try to do it nice (preserve extension & added language in name).

Kodi can parse and extract subtitles directly from video files.

  • Unsure if this requires a setting.

  • Is duplicating file-internal subtitles confusing Kodi?

I don't know about this. I assume external always comes first, so duplicates wluld be at the bottom. But if Kodi can handle it, we could skip mkv source from returned list. It would make it strange though if file subtitles would be at the bottom. Haven't tested, so all are pure hypotheticals.

Kodi makes use of both subtitle language and their subtitle variants.

  • There are useful settings to auto-enable CC/HI/SDH, but unsure exactly how they are applied for putio-kodi.

  • Closed Caption (CC)

  • Hearing-impaired (HI) -- sometimes gets parsed as Hindi language though.

  • Subtitles for Deaf and Hard of hearing (SDH)

  • Others variants?

  • If "manipulating" the filename/language, does putio also need to perform such steps to not lose valuable variant information?

We don't parse or keep these properties. If we decide to let Kodi handle file subtitles (point above), maybe these can be deduced. Parsing these ourselves seem to complicated.

For undefined/unknown language, do not add anything and leave filename parsing to Kodi.

At this point I am actually unsure if manipulating subtitle filenames helps, or disrupts, Kodi's subtitle handling. Perhaps it's easiest to just leave all subtitle filename (assuming that the filename is not shortened) parsing to Kodi? Would have to investigate several test cases first.

I think saying unknown is better than sometimes identifying in wrong. I could trust Kodi extract from file, but parsing filename seems error prone.

Copy link

@joelpurra joelpurra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Several notes in the code review overlap the below comments. (Did not yet run the code, only read it.)

I assume that Kodi already can/will parse the subtitle filename to guess the language.

  • putio-kodi adding the language code to the filename when it is parsed from the filename is unnecessary. Kodi probably does general parsing better.
  • Externally acquired information, perhaps tagging from opensubtitles, might help though. (I already spotted opensubtitle language tagging errors though.)

It does, but it does poorly (requires a certain way). This all started by me seeing Hindi for Hearing Impaired the other day and thinking we could do better. Can Kodi do better than us? For folder subtitles, maybe. For file subtitles and OpenSubtitles subtitles, probably no.

Kodi does parse the subtitle filenames provided by putio-kodi, as made evident by the parsed English - Video (External) format in kodi's subtitle picker (for parseable filename formats). (This should have been obvious to me earlier, but didn't think of the full language name prefix as having been parsed already.)

Parsing seemingly works the same way it would if .srt files were found on the local disk (where putio-kodi does place them) next to the video file. While I can agree that kodi is not the best at parsing out languages, the format is at the very least established both by kodi and (often) matches filenames provided opensubtitles (where uploaders choose the filename).

The language extraction logic in putio-kodi for folder subtitles seems to mirror that of kodi. The putio-kodi folder subtitle filename parsing merely duplicates the last part of the filename as the last-last part:

- Video.eng.srt
+ Video.eng.eng.srt

That does not add new information nor help kodi; I would remove this logic.

The subtitle filenames sometimes seem cut/shortened/ellipsised (...).

  • This could be due to the full file path being too long.
  • Shortened names could lose information, in particular at the end.
  • I forget if it is actively shortened by putio-kodi, or somewhere else?

I don't know about Kodi but we also sometimes shorten them. We try to do it nice (preserve extension & added language in name).

There is display name shortening logic in putio-kodi, but don't think it applies to subtitle filenames.

While there may be api filename length limitations, I would attempt to keep the full subtitle for kodi to parse. Tried to look at the special:// protocol path handling to see if it explicitly limits path length, but didn't see anything obvious -- which doesn't mean that there aren't (platform-specific) checks and limitations.

Kodi can parse and extract subtitles directly from video files.

  • Unsure if this requires a setting.
  • Is duplicating file-internal subtitles confusing Kodi?

I don't know about this. I assume external always comes first, so duplicates wluld be at the bottom. But if Kodi can handle it, we could skip mkv source from returned list. It would make it strange though if file subtitles would be at the bottom. Haven't tested, so all are pure hypotheticals.

Kodi (seemingly) correctly parses/extracts from video files (mkv, and presumably mp4 and other container formats). These "internal" (as opposed to .srt files which are labeled "external") subtitles indeed end up at the bottom of the list of subtitles. The language code seems to be properly extracted, as are flags such as "default" and "forced" (unsure about CC/HI/SDH). They are displayed as English [default,forced] (N/M) where N and M is the subtitle index/count.

  • There are useful settings to auto-enable CC/HI/SDH, but unsure exactly how they are applied for putio-kodi.
    ...
  • If "manipulating" the filename/language, does putio also need to perform such steps to not lose valuable variant information?

We don't parse or keep these properties. If we decide to let Kodi handle file subtitles (point above), maybe these can be deduced. Parsing these ourselves seem to complicated.

Kodi will parse the provided subtitle filename no matter what, so perhaps it doesn't matter whether putio/putio-kodi handles CC/HI/SDH separately.

For undefined/unknown language, do not add anything and leave filename parsing to Kodi.
...

I think saying unknown is better than sometimes identifying in wrong. I could trust Kodi extract from file, but parsing filename seems error prone.

Explicitly adding unknown/undefined/undetermined short-circuits kodi's language parsing. While kodi may also fail to determine the language by looking at the subtitle filename, adding und tells it not to try (because und becomes the "language"). Adding und means that the parsed language will always -- not sometimes -- be identified wrongly. (You can argue whether or not "Hindi" is "more wrong" than "Undetermined", but since kodi does not incorrectly guess "Hindi" for each and every subtitle it will be "more wrong" to add und.)

subtitle_language = subtitle['language_code'] or ''
if len(subtitle_language) == 3: # sometimes this returns wrong
subtitle_name = subtitle_name + ' - ' + subtitle_language
if subtitle_name.endswith('.srt'):
Copy link

@joelpurra joelpurra Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded .srt suffix ignores other possible subtitle file formats and extensions. Filtering for .srt extensions may be performed upstream (putio, opensubtitles), meaning that there will only "ever" be .srt files, but this way is not particularly future-proof.

An improvement would be to match against a list of known subtitle file extension, and saving the match to be able to add the file extension back later.

subtitle_name = subtitle_name[:-4]

subtitle_lang = subtitle['language_code'] or ''
if len(subtitle_lang) not in (2, 3): # sometimes this returns wrong

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sanity check is performed before the last place subtitle_lang is assigned a value. While further subtitle_lang assignments may have other checks, this sanity check would make more sense placed last.

if len(subtitle_lang) not in (2, 3): # sometimes this returns wrong
subtitle_lang = 'und'

# language is unknown for folder subtitles, use last part if name ends with `.` + 2-3 chars
Copy link

@joelpurra joelpurra Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename format is not improved by this parsing:

- Video.eng.srt
+ Video.eng.eng.srt

No new information is added, in particular as it already mirrors kodi's own subtitle parsing (assuming that the language code is well-known).


subtitle_lang = subtitle['language_code'] or ''
if len(subtitle_lang) not in (2, 3): # sometimes this returns wrong
subtitle_lang = 'und'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having putio-kodi explicitly inserting the und (undetermined) language code when putio doesn't know the language code (in particular for folder subtitles) means that kodi's own subtitle parsing is "disabled" because und becomes the "language". Since the subtitle always has a language (even if it's not in the filename, metadata, etcetera) this means that und is always the wrong language code.

subtitle_lang = parts[-1]

# add language even if it's also in name, this way shown name will always be the original name
subtitle_fullname = subtitle_name + '.' + subtitle_lang + '.srt'
Copy link

@joelpurra joelpurra Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .srt file extension should mirror the file extension which was removed in the first step. Since both checks have .srt hardcoded, the .srt file extension is assumed to be the only possible case. Do notice that the first .srt check removing .srt only if it matches the file extension, but this line always adds the .srt file extension. This means that another file extension (such as .sub, .mks) will be incorrectly "extended" with .srt and have a file contents versus extension mismatch.

(There is also the possible case of having an invalid subtitle type file extension, such as a mistaken language code .eng suffix, but that would be covered by having a list of known subtitle file extensions in the first step and adding it back here (or not, for mismatched file extension).)

If removing the und fallback value above, make sure to not extend the filename with an empty subtitle_lang string (in addition to the . separator character); instead leave the filename as-is.

# language is unknown for folder subtitles, use last part if name ends with `.` + 2-3 chars
if subtitle['source'] == 'folder':
parts = subtitle_name.split('.')
if len(parts) > 1 and len(parts[-1]) in (2, 3):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a check for a valid language code; any random 2-3 characters will become the "language code". Pretty sure this will yield false matches, since it comes down to the somewhat random (video and) subtitle filenaming format(s) people use.

if subtitle_name.endswith('.srt'):
subtitle_name = subtitle_name[:-4]

subtitle_lang = subtitle['language_code'] or ''

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the putio api response for subtitles, it seems language_code will be provided as und for folder subtitles already on the api level. As commented elsewhere, I find that und reduces usability. I would add a check to also exclude und from "valid"/"useful" subtitles, such that language_code cannot be neither empty nor und.

subtitle_lang = 'und'

# language is unknown for folder subtitles, use last part if name ends with `.` + 2-3 chars
if subtitle['source'] == 'folder':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even if a subtitle with source folder has a valid language_code which is not und, it will be (possibly) overwritten with a locally parsed "language code" here. While the putio api may only return und for folder-sourced subtitles today, this is not a future-proof assumption.

An improvement would be to only perform filename parsing if the language_code is empty or und.

@berkanteber
Copy link
Member Author

So, just writing here instead of replying to code comments one by one.

  • Ellipsis: This is just how we store too long subtitles, we max out at 255 chars. I don't think this would apply to many subtitles. We can consider changing this, but it's not related to Kodi addon, so let's skip this.
  • file subtitles:
    • Showing these as external is not ideal.
    • If Kodi can figure out other properties letting it handling this is better. I don't know if Kodi can recognize all (I just tried once, and it was fine).
    • I don't like to changing it (letting Kodi handle it) for everyone. For example, being below OpenSubtitles is IMO worse than some duplicates below. Also, API returns only the user's languages, so it's a shorter list.
    • I think we can put this as a setting: Hide File Subtitles (w/ explanation as subtext). I think this could be an advanced setting, since it might still confuse some people.
    • These all include language.
      • Since we use {name}-{lang}.srt while extracting and saving, these would become {name}-{lang}.{lang}.srt but display as {language} - {name} {lang}. This is I think fine, because (1) futureproofing if we change how we save them, (2) can see what was the original name. I weakly prefer adding the language code, but doing nothing is also fine.
      • One comment was about len(lang_code) in (2, 3). I know this is not 100% correct but it would still eliminate not all but some wrong values. We could also just remove this check because the ratio of these was very very small, but I prefer keeping it.
  • folder subtitles:
    • I prefer Undetermined to wrong ones. But I can accept not doing anything and let Kodi try to recognize it. This is the same as loading a file, so we can blame Kodi or if we really want correctness rename files in put.io.
  • OpenSubtitle subtitles:
    • These all include language, and mostly correct. I think we should add this always.
    • If we let hiding file subtitles, we should let hide these as well. These above the file subtitles are not good IMO. It is still nice to be able to hide these regardless of file subtitles.
  • Non-SRT:
    • We can ask for a specific format (SRT or WebVTT) and it's documented, so unlikely to change. We could just explicitly ask for SRT instead of relying on default. We could additionally ignore any non-SRT, a bit useless but just in case.

So, I say:

  • New format for:
    • OpenSubtitles: 100%.
    • File subtitles: I'm 60-40% in favor.
    • Folder subtitles: I'm 50-50.
    • No settings for these. Don't need to complicate things.
  • New settings:
    • Hide File Subtitles, Hide OpenSubtitles Subtitles. Maybe add Hide Folder Subtitles. Opt-in to hide, so that if we add another source, they will be shown.
    • I prefer this to be advanced level. (Levels are: beginner, standard, advanced, expert.)

@berkanteber
Copy link
Member Author

Note: There is a new format for settings which allows help text. We could explain the reason for file subtitles there.

@joelpurra
Copy link

@berkanteber: quick notes; I'll come back to this when I can.

  1. I don't think settings to enable/disable specific sources of subtitles are necessary. (I hope it didn't sound like it in my comments?) Not sure if anyone would turn off any of the sources -- more subtitle options is presumably better?
  2. It's totally ok that the "internal" file subtitles parsed by kodi end up at the bottom of the subtitle picker; that's probably just the internal kodi subtitle processing order. It also doesn't matter (to me) that .srt show up as "external" -- seeing it in the subtitle picker doesn't mean anything particular to me.

Since I had a bunch of code comments, I'm considering creating a separate draft pull request (based on your code) to show rather than tell. Again, I didn't have time to run the code yet -- sorry about that.

@berkanteber
Copy link
Member Author

berkanteber commented Oct 30, 2024

  1. I don't think settings to enable/disable specific sources of subtitles are necessary. (I hope it didn't sound like it in my comments?) Not sure if anyone would turn off any of the sources -- more subtitle options is presumably better?

It didn't sound like that. But thinking about the fact that Kodi already recognizes file subtitles, I thought we could use that. But there are some negatives as well (placement & all vs user languages), so I thought options would be better.

If not options, it would be better to keep file subtitles IMO. But it wouldn't complicate too much (just filter by source), and if we make the option only for advanced/expert, it wouldn't confuse people. (I have no idea if people use advanced settings or not, I just assume not.)

  1. It's totally ok that the "internal" file subtitles parsed by kodi end up at the bottom of the subtitle picker; that's probably just the internal kodi subtitle processing order. It also doesn't matter (to me) that .srt show up as "external" -- seeing it in the subtitle picker doesn't mean anything particular to me.

This is the correct order generally. In all places I can think of, external subtitles comes first. External = folder generally. The issue is we use external for all subtitles, so order becomes: (1) folder, (2) file (user language), (3) OpenSubtitles, (4) file (by Kodi, all language).

If we decide to keep all sources, this is OK, But just hiding 2 and using 4 makes the order bad.

Since I had a bunch of code comments, I'm considering creating a separate draft pull request (based on your code) to show rather than tell. Again, I didn't have time to run the code yet -- sorry about that.

No problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Subtitle filenames have language code appended after .srt file extension
2 participants