Replies: 4 comments
-
A few thoughts:
|
Beta Was this translation helpful? Give feedback.
-
I agree with your firs objection. There is no tight correlation. I just thought I’d be better to But your second objection is pretty strong. So using this „time-gate“ could not prevent such efforts but at least filter the lazy trolls. Regarding the last two objectives I think it is very unlikely, that someone I'd like to stress out, that I do not want to establish censorship. I just think there are statistical methods to efficiently get rid of at least some purposely biasing troll-data without affecting just volatil labeling. behavior. |
Beta Was this translation helpful? Give feedback.
-
I think releasing the data in full, with only legal things like PII removed, and with users and votes identified by hash or something would help. Then we have something that the statistically minded can investigate and explore. Like it may be possible to group people into clusters and find "bad neighbourhoods" like how Google's PageRank identified link-spam. Also this would be a great area of investigation for data science students looking for some real world experience. I'm guessing there's a lot of people from the academic world in the community. |
Beta Was this translation helpful? Give feedback.
-
Make it accessible to everyone for statistical investigations sounds interesting. But I can't remember a consent form relating to this. So I am afraid that might not be possible. |
Beta Was this translation helpful? Give feedback.
-
in order to reach more representative labeling-results there is a need of filtering purposely inappropriate labeling.
Potential solutions and implementations should be discussed here.
All labels belonging to users identified as being most probably saboteurs or trolls should be taken out and be replaced by the results of additional labeling-rounds.
As this might have an impact on the already existing trees there is some kind of time pressure for this discussion.
But even, if the filters might not apply to the already given data, it should at least be applied to the upcoming data.
Suggestion 1)
To avoid erasing all appreciated heterogeneity in the data it’s important to calculate a certain threshold giving a statistical significant signal indicating that the user is not just unconventional but with 99% probability systematic and purposely harming the data.
If some satistician is around it would be great if you posted the according proper calculation.
It’s important to not recalculate the distributions after filtering has been applied.
Suggestion 2)
Labels that were given in a too short time most likely were not made due to thoughtful considerations. This can either be a standalone indicator or used in conjunction with the signal of suggestion 1. As saboteurs or trolls might adopt to this monitoring, it should in my opinion be sufficient but not necessary to indicate purpose. The calculated time a user at least should need to take labeling decisions should proportional correlate with the amount of text the labeling is referring to. This screening method would also need a certain percentual threshold in order to signal purposely misuse.
I encourage everyone to discuss this issue and implement consensual solutions.
Beta Was this translation helpful? Give feedback.
All reactions