Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a paid advertising medium #130

Open
kingo55 opened this issue Jul 15, 2016 · 21 comments
Open

Adding a paid advertising medium #130

kingo55 opened this issue Jul 15, 2016 · 21 comments

Comments

@kingo55
Copy link
Contributor

kingo55 commented Jul 15, 2016

A lot of Google Display network traffic just shows up under "unknown", likewise a lot of other display networks show up like this. To use this data in Snowplow, we need to look for mkt_network = 'Google AdWords' and refr_urlhost = 'googleads.g.doubleclick.net'

Do you think it's worth classifying them with referer-parser? Happy to submit a pull request with the changes...

screen shot 2016-07-15 at 3 17 20 pm

I suspect we'd need to put some thought into the category naming. e.g. Would we go specific - "display" "cpc" "ppc" or general "advertising"?

@alexanderdean
Copy link
Contributor

Interesting idea @kingo55 ! What do the other maintainers think?

@lstrojny
Copy link
Contributor

Makes sense from my POV. Would go specific with display, cpc, ppc etc.

@kingo55
Copy link
Contributor Author

kingo55 commented Jul 17, 2016

@alexanderdean and @lstrojny I've been grouping some of the sites together but I don't think we can reliably specify the type.

For example, Taboola does content marketing but they may also do display through their network

display:
  Taboola:
    domains:
      - trc.taboola.com
      - api.taboola.com

Do we classify them as "display", "content marketing" or something else? In some ways they sound more like a network than a traffic source.

Thoughts?

@alexanderdean
Copy link
Contributor

alexanderdean commented Jul 17, 2016

Hey @lstrojny, @kingo55 - I think the clue is in the ticket name 😄

Anything more specific than "paid" is going to lead to bikeshedding - because we can't precisely capture an advertising company's business model based simply on their referer...

@kingo55
Copy link
Contributor Author

kingo55 commented Jul 17, 2016

Awesome... that makes things easy then.

Here's a draft I've been working on:

paid:

  Google:
    domains:
      - www.googleadservices.com
      - partner.googleadservices.com
      - googleads.g.doubleclick.net
      - tpc.googlesyndication.com
      - googleadservices.com

  Taboola:
    domains:
      - trc.taboola.com
      - api.taboola.com
      - taboola.com

  Criteo:
    domains:
      - cas.jp.as.criteo.com
      - cas.criteo.com

  Doubleclick:
    domains:
      - ad.doubleclick.net
      - ad-apac.doubleclick.net
      - s0.2mdn.net
      - s1.2mdn.net
      - dp.g.doubleclick.net
      - pubads.g.doubleclick.net

  AppNexus:
    domains:
      - ib.adnxs.com
      - adnxs.com
      - 247realmedia.com

  Sizmek:
    domains:
      - bs.serving-sys.com

  PubMatic:
    domains:
      - sshowads.pubmatic.com

  Acuity Ads:
    domains:
      - acuityplatform.com

  OpenX:
    domains:
      - us-ads.openx.net
      - openx.net
      - servedbyopenx.com
      - openxenterprise.com

  Tribal Fusion:
    domains:
      - cdnx.tribalfusion.com

  Eyeota:
    domains:
      - eyeota.net

  Sociomantic Labs:
    domains:
      - sociomantic.com

  ONE by AOL:
    domains:
      - nexage.com

  Neustar AdAdvisor:
    domains:
      - adadvisor.net

  Casale Media:
    domains:
      - casalemedia.com

  BidSwitch:
    domains:
      - bidswitch.net

  StickyADS.tv:
    domains:
      - stickyadstv.com
      - sfx.stickyadstv.com

  Mixpo:
    domains:
      - mixpo.com

  Yieldmo:
    domains:
      - yieldmo.com

  Jivox:
    domains:
      - jivox.com
  Adform:
    domains:
      - adform.net

  Fluct:
    domains:
      - adingo.jp

  AudienceScience:
    domains:
      - wunderloop.net

  MicroAd:
    domains:
      - microad.jp

  LifeStreet:
    domains:
      - lfstmedia.com

  Rubicon Project:
    domains:
      - optimized-by.rubiconproject.com

  SteelHouse:
    domains:
      - steelhousemedia.com

  Sovrn:
    domains:
      - lijit.com

  Sonobi:
    domains:
      - sonobi.com

  ZEDO:
    domains:
      - zedo.com
      - z1.zedo.com

  AdRoll:
    domains:
      - adroll.com

  Flashtalking:
    domains:
      - flashtalking.com
      - servedby.flashtalking.com

  Outbrain:
    domains:
      - paid.outbrain.com

  Plista:
    domains:
      - farm.plista.com

  White Pages:
    domains:
      - www.whitepages.com.au
      - mobile.whitepages.com.au

  MyShopping.com.au:
    domains:
      - www.myshopping.com.au

  GetPrice.com.au:
    domains:
      - www.getprice.com.au

  Finder.com.au:
    domains:
      - www.finder.com.au
      - fcc.finder.com.au

  Mozo:
    domains:
      - mozo.com.au
      - a.mozo.com.au

  InfoChoice:
    domains:
      - www.infochoice.com.au
      - keyfactssheet.infochoice.com.au

  RateCity.com.au:
    domains:
      - ratecity.com.au
      - direct.ratecity.com.au
      - www.ratecity.com.au

@alexanderdean
Copy link
Contributor

Whoa - great list @kingo55 !

@kingo55
Copy link
Contributor Author

kingo55 commented Jul 18, 2016

Thanks @alexanderdean - here's first cut that seems to do the job in the Python lib: kingo55@ea2d99c

If you want me to keep the paid changes separate to the other source changes, I can split them out and submit in separate pull requests.

@alexanderdean
Copy link
Contributor

Yes please, separate PR would be great!

@ghost
Copy link

ghost commented Jul 20, 2016

@kingo55 @alexanderdean FYI most online marketing campaigns I know are using UTM parameters to identify payed traffic https://en.wikipedia.org/wiki/UTM_parameters.
I think the problem with the referrer list is that there are a lot of online marketing companies out there you can't really manage a list of all of them - also thing about affiliate campaigns.

@alexanderdean
Copy link
Contributor

Hey @DCMNMarc - sure, we make use of UTM parameters in Snowplow heavily.

I think the problem with the referrer list is that there are a lot of online marketing companies out there you can't really manage a list of all of them

Appreciate the point but if people had said the same thing about IP:geolocation then we would never have had things like MaxMind...

@ghost
Copy link

ghost commented Jul 21, 2016

@alexanderdean

Appreciate the point but if people had said the same thing about IP:geolocation then we would never have had things like MaxMind...

this is true but do you think the amount of work for generating and managing such a bug database fits into your workload even it there is already a solution for it using UTM parameters?

Also what happens if you detect paid traffic which is on the same time a known referrer type (like a search engine). As far as I know currently you only support just one.

@alexanderdean
Copy link
Contributor

what happens if you detect paid traffic which is on the same time a known referrer type (like a search engine)

Good point. Any given referrer URI should only be found in the database once. If the same URI is used for two different mediums, like search and paid, then we should give the traffic the benefit of the doubt and make it search (i.e. don't assume paid).

but do you think the amount of work for generating and managing such a bug database fits into your workload even it there is already a solution for it using UTM parameters?

UTM parameters are great but they are only suggestive and they can be omitted, incorrect or spoofed. The adtech landscape is huge [1] but the top 20 vendors very likely account for more than 80% of all revenues, and @kingo55's list is a great start...

So I am broadly in favor of adding this...

[1] http://www.lumapartners.com/resource-center/lumascapes-2/

@ghost
Copy link

ghost commented Jul 21, 2016

great link @alexanderdean

As long as this list doesn't effect other mediums (I would highly recommend a test for it) then I'm fine with it as I don't need to use this feature ;)

@alexanderdean
Copy link
Contributor

Good idea @DCMNMarc - added a linter ticket to enforce this kind of thing #132

Do you use referer-parser directly or as part of Snowplow?

@ghost
Copy link

ghost commented Jul 21, 2016

I'm using the python version directly in a spark application

@kingo55
Copy link
Contributor Author

kingo55 commented Jul 21, 2016

@DCMNMarc - more useful than grouping it all under the vast expanse of "unknown" IMO. Paid traffic behaves very differently and is often laden with bots and unique data in URLs.

We run Snowplow across a range of sites with inconsistently manually tagged URLs. Makes sense to group them from that perspective.

@ryanrozich
Copy link

Having this in referer parser would be very useful to us. Did this ever get merged, or can we just manually update the config from above?

@alexanderdean does making our own updates to these data files cause any problems when upgrading to future versions of snowplow

@kingo55 do you have any updates to the list version you posted here last July?

@alexanderdean
Copy link
Contributor

This hasn't been merged yet @ryanrozich; making edits to this file inside Snowplow shouldn't break anything.

@kingo55
Copy link
Contributor Author

kingo55 commented Feb 9, 2017

@ryanrozich - This was my latest commit: https://github.com/snowplow/referer-parser/pull/139/commits

Not sure why it was failing the tests though.

@rbolkey
Copy link

rbolkey commented Feb 13, 2017

Wanted to confirm my understanding if we were to try to make use of these edits.

  1. Move to self-hosting assets: https://github.com/snowplow/snowplow/wiki/4-Self-hosting-Hadoop-Enrich
  2. Unpack 3-enrich/scala-hadoop-enrich/snowplow-hadoop-enrich-[version].jar
  3. Replace referers.yml.
  4. Re-package and deploy snowplow-hadoop-enrich-[version]-[fork].jar to our hosted assets.
  5. Update our config for the forked version hadoop_enrich: [version]-[fork].

Also, to confirm, if we use the commercial version of MaxMind, we would have to self host anyway?

We do have a concern about how much effort it requires to self-host these assets? What are the best practices there to keep up to date, and how much additional time does it typically take per release. Sounds like using Transmit to sync the hosted assets, and then wash and repeat steps 2-5 above?

Thanks!

@alexanderdean
Copy link
Contributor

Also, to confirm, if we use the commercial version of MaxMind, we would have to self host anyway?

No, you don't have to self-host the jars that Snowplow runs just because you are self-hosting the commercial MaxMind file(s).

Wanted to confirm my understanding if we were to try to make use of these edits.

Yes the 5 steps you list are the correct ones @rbolkey

We do have a concern about how much effort it requires to self-host these assets? What are the best practices there to keep up to date, and how much additional time does it typically take per release.

It would likely add 30 mins or so per upgrade, assuming a fast network connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants