-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some ideas for very useful and helpful functions. #234
Comments
There exist tools for it already:
|
Most of the online tools are simply unusable because they do not support the extraction of entire domains with subdomains and long TLDs such as .stream. |
|
Sorry, no the tool does not work. |
It does work very well, but it extracts domains only and not URLs, Then there are several browser addons I use myself to extract domains/URLs from a webpage: |
All this is not what I would need. |
Wow. Are you sure you don't want to split the file into smaller chunks?: Also this is a good tool I use personally sometimes:
|
😆 🤣 Try Linux 😜 This said, there is a python module that can do this, since you don't have (e)grep available out of the box. |
Yeah, I'm curious as well, how did he end up with a 2GB text file...
I just can't imagine how can a single 2GB text file ever be created in normal conditions.
Perhaps you were just joking, anyway, it seems he didn't like to use Linux:
|
I use Linux most of the time. The data are completely mixed files that are combined into one file. I think with just a few changes to PyFunceble the functions should be easily possible. With PyFunceble it is already possible to read RAW files directly from the internet, if the function could be modified so that any file or e.g. a website is entered as source and simply only URLs/domains are extracted it would be very useful and helpful. At the moment the function is simply limited to RAW files. |
I use Linux since 1999. By the way, Linux was the first system I used. I might use Linux until one day I can look at the radishes from below :D |
As for domains extraction: As I said in #13 (comment) : in case of extracting domains in Adblock Decoder, "Decode everything" mode, will give too many useless false positives which will clutter the output list, making the output a garbare dump. There is a risk the same might happen with a 2GB mixed file, even if it doesn't contain Adblock Filter lists, or if you are lucky, it might not, but it depends on the content. As for URLs extraction: Can have false hits as well: The only solution seems to be to extract everything, and to leave all false hits/garbare as an user's issue to deal with. |
Try this offline tool https://www.softpedia.com/get/Office-tools/Other-Office-Tools/Web-Link-Extractor-Linas.shtml,
|
I can not use this tool, I use Linux as a system. I don't have a VM at the moment because I don't have any free space and I can't buy a new SSD at the moment. I have somewhere in my archive also a self-programmed software in C# to extract all possible URLs. @funilrys What do you think about this idea would it be possible? |
And what about Wine?
It can be done, but the result will contain many false hits / rubbish, you will have to waste time by:
|
I know, I most had been tired as I confused you with someone else 😃 😪 You are on Arch I know...
Sounds like an integration of BeautifulSoup could come in handy!!! This said, I do understand why you (@ZeroDot1) would like to integrate it into @PyFunceble directly. This is not an objection, but a thought of the big picture, would it be more handy to write this as a individual code that can extract all urls/domains from any source? What do you other think? @ZeroDot1 @keczuppp @mitchellkrogza @funilrys
Not true if you are using proper code bases |
I saw it before, but I didn't mention about it coz it supports only HTML or XML, which is not the case the OP requested, he requested to extract from any text.
1I already mentioned about it before (wrote "might / can" instead of "will") as mentioned before:
2You quoted my statement out of context, in my comment #234 (comment), above my statement is a quote to which my statement refers, which means I was reffering to the quote and not talking generally, and the quote says about: "ZeroDot1 : completely mixed text." where "mixed" most likely means "random", which is opposite to "spirillen: use proper (prepared/custom) code base". Hence I can't agree with you saying "not true" in this case.. But yeah, if not reffering to the quote, and when talking generally, it is possible to avoid false hits if you cherry pick the input content (about what I already mentioned before.)
Maybe it can be like Adblock Decoder, it can exists in both ways: as integrated into PyFunceble and as standalone |
Hey @ZeroDot1 As I'm re-reading your suggestion, I sitting here and thinking:
Related to: |
The search "function" won't be my priority. Other tools should be able to handle it better. But some dedicated tools which proxies our internal decoders (like the adblock-decoder) may be provided in the future. Let's keep this open. |
btw, the PyFunceble Web Worker project, provides some endpoints for the decoding/conversion of inputs. It basically exposes the (internal) converter of PyFunceble behind a web server / API. I can't and don't want to hosts such a service (yet) but it can be a good alternative for some people ... I'm still ready to fix the issues reported there though. |
Add a URL/Domain extractor.
With this function it should be possible to extract all URLs/domains from any text and save them to a file so that they can be easily checked at a later time without significant time and effort.
Simply a useful function for blacklist developers.
Add a search function (yes I know you can do that with Linux, it would just be very handy to be able to do everything with one program).
With the search function it should be possible to search all URLs/domains with a given string in a file and save it to another file, and it should be possible to search multiple keywords by comma separation directly after each other.
This function is also very helpful for blacklist developers.
The text was updated successfully, but these errors were encountered: