You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This will allow us to successfully pass the following spec:
it'knows what is not a domain 1'doskip"NOT IMPLEMENTED"text="this is a sentence.and no domain."pt=PragmaticTokenizer::Tokenizer.new(text,remove_domains: true)expect(pt.tokenize).toeq(["this","is","a","sentence",".","and","no","domain","."])end
The text was updated successfully, but these errors were encountered:
diasks2
changed the title
Should all TLD domains be whitelisted?
Should all TLDs be whitelisted?
Jan 20, 2016
The longer I think about it, the more downsides it might have when using the complete list of TLDs. TLDs like .glass, .global, .google, .green etc. might more frequently be used as the first word of a new sentence (similar to the spec above), than being used as a domain.
What if this list is saved as a constant (similar to abbreviations, stop words etc), but with the option of passing an array of TLDs to remove_domains: ['com', 'net', 'org'] that will restrict to only these? Then users could define the 3-5 TLDs that most of the domains they deal with use, and prevent issues of having pragmatic_tokenizer identify too many non-domains as domain.
Here is the current list: http://data.iana.org/TLD/tlds-alpha-by-domain.txt
This will allow us to successfully pass the following spec:
The text was updated successfully, but these errors were encountered: