-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
not working on google.gr #113
Comments
OK, it's not working on any google domain, what am I missing? |
import com.snowplowanalytics.refererparser.Parser
val parser = new Parser()
val referer = parser.parse("https://www.google.gr", "http://www.example.com")
println(referer) The above prints "{medium: search, source: Google, term: null}", which is the expected result for the Java library. What is going wrong for you @Valve ? Which language's version of the library are you using? |
@fblundun oh sorry, I didn't know this has many language versions. I'm using Ruby version. Here is my output:
|
Assigning to @kreynolds , the maintainer for the Ruby library... |
Just tested this on the php version and all the google domains I tried do not return the search terms. https://www.google.com/?gws_rd=ssl#safe=off&q=testing medium: search |
@morrow95 Google doesn't provide keyword terms for searches done on HTTPS, which is now the vast majority of them: https://searchenginewatch.com/sew/news/2296351/goodbye-keyword-data-google-moves-entirely-to-secure-search |
@yalisassoon - I've been aware of that change for some time, but in my use case I have the actual url's such as the one I listed in the earlier post. As you can see in my example the q= is given, so, I would expect the parser to return back 'testing' as the term(s). For something like |
For what it is worth... removing the '#safe=off' from the urls I mentioned above and the parser correctly returns the search terms. It would appear the '#' is causing the parser to not handle the parameters correctly and preventing it from returning the terms (q). |
Anyone look into this? You can try the above mentioned yourself and see the same results. |
This is not a Ruby-specific problem. I would be surprised to find any of the language bindings that parsed this URL as you're expecting it to. The root of the problem is that everything after the This is a general problem with JavaScript-heavy webpages which want to represent the URL to the user without causing an actual page nav to take place in the browser. They're (ab)using the fragment as a place to store information on what would traditionally have been a page-load inducing The code to fix this would likely look like this (in Python): the_url = "https://www.google.com/?gws_rd=ssl#safe=off&q=testing"
parsed_url = urlparse(the_url)
if 'google.com' in parsed_url.netloc and 'q=' in parsed_url.fragment: # <-- Or some other fragile heuristic
new_url = the_url.replace('#', '&', 1)
parsed_url = urlparse(new_url)
# continue parsing as before I'm sure my bias is showing through in the comment above, but I'm a +0 on adding this capability into referer-parser. The biggest reason is that I don't believe there's a generic, widely-applicable heuristic that we could build into the But, I'm also sensitive to the concern of referer-parser punting on this, because it would effectively put the burden on users of our respective libraries to add a snippet as above any time they were dealing with a domain that pulled these kinds of shenanigans. So if we could determine how many domains do this, or find a way to encode the special-cases in /cc @alexanderdean |
Thanks for finally bottoming this one out @donspaulding! Ouch, that's a pretty nasty behavior by Google and friends. Another challenge with fixing this is that I'm pretty sure it will require the referer URI to be passed in to referer-parser as a string, because if you pass it in as a proglang-idiomatic |
Yes, this is certainly not language specific. As has been mentioned this is a way to prevent/hinder easy 'decoding' of the url to collect the search terms. I know Google does this now, but not sure if anyone else has followed the practice since. I honestly never looked at the snowplow code before, but after reading the responses above that tells me snowplow uses the default url parsing that each language provides (I thought snowplow had its own implementation of parsing the url this whole time). Looking at the referrers.yml for Google :
you are already passing the needed qualifier to determine the search terms, but of course these new style urls are not parsed as expected (the # makes it a fragment rather than query). The only way to get around this would be adding special parse cases as donspaulding pointed out. I haven't looked into the uniformity that Google uses (like if the #safe-off is always used), but it seems like any solution might open up false positives unless of course there is strict uniformity on Google's side to how they present the url. While you could strip the #safe-off that assumes Google doesn't use q= and query= in the query part of the url for any of their NON-search related urls else you would have false positives. For what it is worth my usage case of snowplow includes having the full string urls. I am essentially passing url strings into it and collecting data from the results such as what search engine was used, search terms were used, and so on. I build my own data of those results like viewing how often terms were searched even across different engines. Being that Google is the most widely used se this leaves quite the gap if you plan on using the results for any sort of data/reports as far as search terms go. |
Hey @morrow95 - thanks for this, that's a lot of helpful context. This was surprising to be:
Do you have a source for search engines doing this deliberately to obfuscate the URI? |
I honestly do not - I haven't really been paying as much attention to SEO related information in the past years as I used to, however, as someone pointed out earlier Google specifically made this change not all that long ago. Frankly, and IMO, I believe this was their way of toning down SEO related activities as well as pushing their own analytics platform onto site owners. Being that Google has pretty much always represented most of the search market percentage it has always been the goal to 'figure out' what works and what doesn't with their algorithm to enhance rankings. I, and many others, see most of their recent changes aimed at eradicating these possibilities. Without going on and on, I think 'hiding' the search terms was just another reason they did this, but it might also have to do with privacy and a number of other things - who knows. Anyways, I think if snowplow wants to extract this information from Google, and any others who took this practice, some conditionals and expressions are going to be needed in the code to handle it. |
For what it is worth this is the first result in a quick search - http://adage.com/article/dataworks/google-hides-search-terms-publishers-marketers/244949/. If you want to read more then there should be an abundance of information and opinions out there about it. |
not working on these urls:
https://www.google.gr
http://google.gr
http://google.tn
The text was updated successfully, but these errors were encountered: