Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for handling non-Latin characters #7

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

qurbat
Copy link

@qurbat qurbat commented Apr 18, 2022

This change introduces support for search results containing non-Latin characters as part of the URL or description.

This is done by passing the final_string variable to the html.unescape() function (instead of printing it directly) at the last print call.

qurbat added 2 commits April 19, 2022 00:28
This change introduces support for text containing non-Latin characters (Hindi, Urdu, Greek, for example).

This is done by printing `html.unescape(final_string)` instead of `final_string`.
@qurbat
Copy link
Author

qurbat commented Apr 19, 2022

@deepseagirl could you merge this after review?

@qurbat qurbat changed the title Convert HTML entities Fix for handling non-Latin characters May 26, 2022
@qurbat
Copy link
Author

qurbat commented May 26, 2022

@deepseagirl hi, just sending a ping on this. thanks!

@qurbat
Copy link
Author

qurbat commented Jun 27, 2022

@deepseagirl Can we close this?

@deepseagirl
Copy link
Owner

hi, thanks. this is a good improvement :)
i moved the unescape to only occur on the result descriptions directly
with a flag to toggle the behavior on/off

new default will be to decode character references:

$ python3 degoogle.py "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:⟿ - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

flag to turn decoding off:

$ python3 degoogle.py -d "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:⟿ - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

the html.unescape python doc links to this list of named character references which seemed handy.
i didn't realize char references were such an in depth thing until now. if you're interested here is that link
https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

@deepseagirl
Copy link
Owner

i'll finalize this when i have a few more mins. should be soon now that it's this far along. thanks again

@qurbat
Copy link
Author

qurbat commented Jul 12, 2022

@deepseagirl no worries, and I realize you were not able to access a computer earlier, so it is no problem. the new changes look great! thank you & tc =)

@qurbat
Copy link
Author

qurbat commented Oct 8, 2022

@deepseagirl can we close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants