- Background
- Features
- New ideas from readings
- Dataset
Pillar is creating machine learning models to detect great content during twitch streams. We have two models in mind: 1) predict which clips in a stream have potential to go viral 2) predict which clips (and in what order) could be used in a youtube video.
All code and documentation so far is focused on 1) predict which clips in a stream have potential to viral.
- Algorithm 1: Find the best moments in clips based on where the most users participated. Most is defined as the ratio of unique users during a 2 min section to unique users for the entire session.
- Algorithm 2 Find the best moments in clips based on when rate of messages per user peaked. This involves answering the question "at which 2 min segment do the most users send the most messages?". If users X, Y, and Z all send 60% of their messages at timestamp range delta, then that timestamp might qualify as a "best moment"
- NOTE: Currently answers the question "at which 2 min segment do users send the most messages fastest"
- Algorithm 3 (WIP) Weigh each user by their chat rate, account age, etc. Heavier users predicted to chat more often at "best moment" timestamps
- STATUS: current weight determined by (
num_words_of_user
/num_words_of_top_user
) - Algorithm 3.5 Weighs clips based on most number of words/emojis/both used in chat
- Algo 3.6: ranks algorithms based mean emoji use by users, calculated as "number of emojis used at timestamp divided by number of unique users at that timestamp"
- STATUS: current weight determined by (
- Algo 4: NLP sentiment per clip segment
- Weighs each timestamp by what percent of sentiment was positive, negative, or neutral
- Timestamp with highest percentage of sub users participating
- Cluster chat via brown clustering or something like that
- This is basically algo 3_5: Repeated Lines This simple comparison is also a running percentage of the number of repeated lines in a segment. Emotes excluded again this is a very simple text comparison. In future iterations I would hope to be able to take advantage of a better (and cheaper :)) text analysis tool to better compare text lines for similarity.
- This is basically algo 3_6: Emote Spam This simple calculation shows the percentage of chat lines that were emote only and contained more than one emote. No hard calculations here as I am able to grab the emote tags to determine if emotes are present and how many and this becomes my counter.
ML_Infra created a dataset from the top 100 CCCs for the top 50 games, and pulled the chat logs for each videoId that the CCCs came from.
Currently its in https://github.com/pillargg/timestamps/tree/add-datasetcreator-lambda/lambdas/datasetcreator, but it will be moved to the ML_Infra repo soon.