not working on google.gr #113

Valve · 2015-09-22T13:50:25Z

not working on these urls:

https://www.google.gr
http://google.gr
http://google.tn

Valve · 2015-09-22T14:09:22Z

OK, it's not working on any google domain, what am I missing?

fblundun · 2015-09-23T08:19:47Z

import com.snowplowanalytics.refererparser.Parser
val parser = new Parser()
val referer = parser.parse("https://www.google.gr", "http://www.example.com")
println(referer)

The above prints "{medium: search, source: Google, term: null}", which is the expected result for the Java library. What is going wrong for you @Valve ? Which language's version of the library are you using?

Valve · 2015-09-23T08:46:05Z

@fblundun oh sorry, I didn't know this has many language versions. I'm using Ruby version.

Here is my output:

ruby -v                                                                                                                                                                                                                                                  
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
rails c                                                                                                                                                                                                                                                  
Loading development environment (Rails 4.2.4)
[1] pry(main)> RefererParser::Parser.new.parse('http://google.com')
=> {:known=>false, :uri=>"http://google.com"}

fblundun · 2015-09-23T09:45:58Z

Assigning to @kreynolds , the maintainer for the Ruby library...

morrow95 · 2016-05-18T03:25:05Z

Just tested this on the php version and all the google domains I tried do not return the search terms.

https://www.google.com/?gws_rd=ssl#safe=off&q=testing

medium: search
source: Google
terms:

yalisassoon · 2016-05-18T07:11:44Z

@morrow95 Google doesn't provide keyword terms for searches done on HTTPS, which is now the vast majority of them: https://searchenginewatch.com/sew/news/2296351/goodbye-keyword-data-google-moves-entirely-to-secure-search

morrow95 · 2016-05-18T15:51:37Z

@yalisassoon - I've been aware of that change for some time, but in my use case I have the actual url's such as the one I listed in the earlier post. As you can see in my example the q= is given, so, I would expect the parser to return back 'testing' as the term(s).

For something like https://www.google.com/#safe=off&q=testing+one+two+three I would expect 'testing one two three' returned.

morrow95 · 2016-05-19T03:43:06Z

For what it is worth... removing the '#safe=off' from the urls I mentioned above and the parser correctly returns the search terms. It would appear the '#' is causing the parser to not handle the parameters correctly and preventing it from returning the terms (q).

morrow95 · 2016-07-06T07:16:39Z

Anyone look into this? You can try the above mentioned yourself and see the same results.

donspaulding · 2016-07-06T16:32:14Z

This is not a Ruby-specific problem. I would be surprised to find any of the language bindings that parsed this URL as you're expecting it to.

The root of the problem is that everything after the # is the "fragment" portion of the URL. Even if the fragment is structured to look like the "query" portion of the URL, all of the language parsers will treat everything after that # as a single string which comprises the fragment.

This is a general problem with JavaScript-heavy webpages which want to represent the URL to the user without causing an actual page nav to take place in the browser. They're (ab)using the fragment as a place to store information on what would traditionally have been a page-load inducing document.href or form.submit() call.

The code to fix this would likely look like this (in Python):

the_url = "https://www.google.com/?gws_rd=ssl#safe=off&q=testing"
parsed_url = urlparse(the_url)
if 'google.com' in parsed_url.netloc and 'q=' in parsed_url.fragment:  #  <-- Or some other fragile heuristic
    new_url = the_url.replace('#', '&', 1)
    parsed_url = urlparse(new_url)
# continue parsing as before

I'm sure my bias is showing through in the comment above, but I'm a +0 on adding this capability into referer-parser. The biggest reason is that I don't believe there's a generic, widely-applicable heuristic that we could build into the referers.yaml file to detect and correct these types of abuses of the URL syntax.

But, I'm also sensitive to the concern of referer-parser punting on this, because it would effectively put the burden on users of our respective libraries to add a snippet as above any time they were dealing with a domain that pulled these kinds of shenanigans. So if we could determine how many domains do this, or find a way to encode the special-cases in referers.yaml, I'm probably easily swayed.

/cc @alexanderdean

alexanderdean · 2016-07-06T17:22:35Z

Thanks for finally bottoming this one out @donspaulding! Ouch, that's a pretty nasty behavior by Google and friends.

Another challenge with fixing this is that I'm pretty sure it will require the referer URI to be passed in to referer-parser as a string, because if you pass it in as a proglang-idiomatic Url class or equivalent, it's probably "too late" to fix it. This then leads to the unfortunate side effect that the library's behavior will likely be different depending on whether you supply a string or a proglang-idiomatic Url class.

morrow95 · 2016-07-06T18:26:08Z

Yes, this is certainly not language specific. As has been mentioned this is a way to prevent/hinder easy 'decoding' of the url to collect the search terms. I know Google does this now, but not sure if anyone else has followed the practice since. I honestly never looked at the snowplow code before, but after reading the responses above that tells me snowplow uses the default url parsing that each language provides (I thought snowplow had its own implementation of parsing the url this whole time).

Looking at the referrers.yml for Google :

   Google: 
     parameters: 
       - q 
       - query # For www.cnn.com (powered by Google) 
       - Keywords # For gooofullsearch.com (powered by Google) 
     domains: 
       - www.google.com

you are already passing the needed qualifier to determine the search terms, but of course these new style urls are not parsed as expected (the # makes it a fragment rather than query). The only way to get around this would be adding special parse cases as donspaulding pointed out. I haven't looked into the uniformity that Google uses (like if the #safe-off is always used), but it seems like any solution might open up false positives unless of course there is strict uniformity on Google's side to how they present the url. While you could strip the #safe-off that assumes Google doesn't use q= and query= in the query part of the url for any of their NON-search related urls else you would have false positives.

For what it is worth my usage case of snowplow includes having the full string urls. I am essentially passing url strings into it and collecting data from the results such as what search engine was used, search terms were used, and so on. I build my own data of those results like viewing how often terms were searched even across different engines. Being that Google is the most widely used se this leaves quite the gap if you plan on using the results for any sort of data/reports as far as search terms go.

alexanderdean · 2016-07-06T21:28:14Z

Hey @morrow95 - thanks for this, that's a lot of helpful context. This was surprising to be:

As has been mentioned this is a way to prevent/hinder easy 'decoding' of the url to collect the search terms.

Do you have a source for search engines doing this deliberately to obfuscate the URI?

morrow95 · 2016-07-06T22:21:57Z

I honestly do not - I haven't really been paying as much attention to SEO related information in the past years as I used to, however, as someone pointed out earlier Google specifically made this change not all that long ago. Frankly, and IMO, I believe this was their way of toning down SEO related activities as well as pushing their own analytics platform onto site owners. Being that Google has pretty much always represented most of the search market percentage it has always been the goal to 'figure out' what works and what doesn't with their algorithm to enhance rankings. I, and many others, see most of their recent changes aimed at eradicating these possibilities. Without going on and on, I think 'hiding' the search terms was just another reason they did this, but it might also have to do with privacy and a number of other things - who knows.

Anyways, I think if snowplow wants to extract this information from Google, and any others who took this practice, some conditionals and expressions are going to be needed in the code to handle it.

morrow95 · 2016-07-06T22:25:03Z

For what it is worth this is the first result in a quick search - http://adage.com/article/dataworks/google-hides-search-terms-publishers-marketers/244949/. If you want to read more then there should be an abundance of information and opinions out there about it.

fblundun assigned kreynolds Sep 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

not working on google.gr #113

not working on google.gr #113

Valve commented Sep 22, 2015

Valve commented Sep 22, 2015

fblundun commented Sep 23, 2015

Valve commented Sep 23, 2015

fblundun commented Sep 23, 2015

morrow95 commented May 18, 2016 •

edited

Loading

yalisassoon commented May 18, 2016

morrow95 commented May 18, 2016 •

edited

Loading

morrow95 commented May 19, 2016 •

edited

Loading

morrow95 commented Jul 6, 2016

donspaulding commented Jul 6, 2016 •

edited

Loading

alexanderdean commented Jul 6, 2016

morrow95 commented Jul 6, 2016

alexanderdean commented Jul 6, 2016

morrow95 commented Jul 6, 2016

morrow95 commented Jul 6, 2016 •

edited by alexanderdean

Loading

not working on google.gr #113

not working on google.gr #113

Comments

Valve commented Sep 22, 2015

Valve commented Sep 22, 2015

fblundun commented Sep 23, 2015

Valve commented Sep 23, 2015

fblundun commented Sep 23, 2015

morrow95 commented May 18, 2016 • edited Loading

yalisassoon commented May 18, 2016

morrow95 commented May 18, 2016 • edited Loading

morrow95 commented May 19, 2016 • edited Loading

morrow95 commented Jul 6, 2016

donspaulding commented Jul 6, 2016 • edited Loading

alexanderdean commented Jul 6, 2016

morrow95 commented Jul 6, 2016

alexanderdean commented Jul 6, 2016

morrow95 commented Jul 6, 2016

morrow95 commented Jul 6, 2016 • edited by alexanderdean Loading

morrow95 commented May 18, 2016 •

edited

Loading

morrow95 commented May 18, 2016 •

edited

Loading

morrow95 commented May 19, 2016 •

edited

Loading

donspaulding commented Jul 6, 2016 •

edited

Loading

morrow95 commented Jul 6, 2016 •

edited by alexanderdean

Loading