Fix for handling non-Latin characters #7

qurbat · 2022-04-18T19:02:00Z

This change introduces support for search results containing non-Latin characters as part of the URL or description.

This is done by passing the final_string variable to the html.unescape() function (instead of printing it directly) at the last print call.

This change introduces support for text containing non-Latin characters (Hindi, Urdu, Greek, for example). This is done by printing `html.unescape(final_string)` instead of `final_string`.

qurbat · 2022-04-19T21:57:55Z

@deepseagirl could you merge this after review?

qurbat · 2022-05-26T23:14:58Z

@deepseagirl hi, just sending a ping on this. thanks!

qurbat · 2022-06-27T18:11:31Z

@deepseagirl Can we close this?

deepseagirl · 2022-07-12T19:24:41Z

hi, thanks. this is a good improvement :)
i moved the unescape to only occur on the result descriptions directly
with a flag to toggle the behavior on/off

new default will be to decode character references:

$ python3 degoogle.py "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:⟿ - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

flag to turn decoding off:

$ python3 degoogle.py -d "intitle:⟿ inurl:⟿"
-- 9 results --

TranslingualEdit - Wiktionary
https://en.wiktionary.org/wiki/%E2%9F%BF

Talk:&#10239; - Wiktionary
https://en.wiktionary.org/wiki/Talk:%E2%9F%BF

the html.unescape python doc links to this list of named character references which seemed handy.
i didn't realize char references were such an in depth thing until now. if you're interested here is that link
https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references

deepseagirl · 2022-07-12T19:33:34Z

i'll finalize this when i have a few more mins. should be soon now that it's this far along. thanks again

qurbat · 2022-07-12T19:43:31Z

@deepseagirl no worries, and I realize you were not able to access a computer earlier, so it is no problem. the new changes look great! thank you & tc =)

qurbat · 2022-10-08T13:35:22Z

@deepseagirl can we close?

qurbat added 2 commits April 19, 2022 00:28

Convert HTML entities

faa0615

This change introduces support for text containing non-Latin characters (Hindi, Urdu, Greek, for example). This is done by printing `html.unescape(final_string)` instead of `final_string`.

Update degoogle.py

aec8f17

qurbat mentioned this pull request Apr 18, 2022

Introduce support for non-Latin characters #8

Open

qurbat changed the title ~~Convert HTML entities~~ Fix for handling non-Latin characters May 26, 2022

deepseagirl added 2 commits July 12, 2022 15:26

add decoding for special chars in result desc

1264719

document new decoding toggle flag

610397e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for handling non-Latin characters #7

Fix for handling non-Latin characters #7

qurbat commented Apr 18, 2022 •

edited

Loading

qurbat commented Apr 19, 2022

qurbat commented May 26, 2022

qurbat commented Jun 27, 2022

deepseagirl commented Jul 12, 2022

deepseagirl commented Jul 12, 2022

qurbat commented Jul 12, 2022

qurbat commented Oct 8, 2022

Fix for handling non-Latin characters #7

Are you sure you want to change the base?

Fix for handling non-Latin characters #7

Conversation

qurbat commented Apr 18, 2022 • edited Loading

qurbat commented Apr 19, 2022

qurbat commented May 26, 2022

qurbat commented Jun 27, 2022

deepseagirl commented Jul 12, 2022

deepseagirl commented Jul 12, 2022

qurbat commented Jul 12, 2022

qurbat commented Oct 8, 2022

qurbat commented Apr 18, 2022 •

edited

Loading