Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Online Predatory Conversation Detection (Epic) #19

Open
hosseinfani opened this issue Feb 4, 2023 · 3 comments
Open

Online Predatory Conversation Detection (Epic) #19

hosseinfani opened this issue Feb 4, 2023 · 3 comments
Assignees
Labels

Comments

@hosseinfani
Copy link
Member

hosseinfani commented Feb 4, 2023

@rezaBarzgar
@hamedwaezi01
@impedaka
@EhsanSl

I created this issue labeled as epic to have general planning for this project. All other tasks, subtasks, etc will be linked to this as separate issue pages. Let's ask Reza (@rezaBarzgar) to lead this project. @rezaBarzgar Please dispatch the tasks and monitor the progress, merging codes, etc ... thanks.

Ps. Thanks to Alice (@impedaka), who created a nice web demo at here. Also, she has done experiments using linear models. Also, thanks to @EhsanSl, who is working on the pipeline and bringing new insights to it.

Here are the main to-do tasks. Please feel free to comment or revise.

(1) Problem Definition

(2) Proposed Method

  • Model Architecture:
    • Non-neural Models ==> Refactor the current codebase and make it ready to add new models!
    • Neural Models: (these models can be configured by changing the settings file)
      • Feedforward Network
      • A naïve CNN
      • RNN
      • LSTM
      • GRU
  • Training Strategy
    • Text preprocessing (informal to formal)
    • Sampling to handle the imbalance
    • Curriculum Learning

(3) Experimentation

(4) Paper Write Up

  • Target Conference/Journal: ECIR 2024
    • Full paper abstract submission: September 20, 2023, 11:59 pm (AoE)
    • Full paper submission: September 27, 2023, 11:59 pm (AoE)
    • Full paper notification: December 14, 2023
    • Main conference: March 25-27, 2024
  • The Google Docs draft link

(5) Demo website (#16 ) ==> By @impedaka

@hamedwaezi01
Copy link
Member

Hi
I created this diagram to show the general flow classification from raw data to the output label in this project. It for sure needs a lot of adjustments, so please let me know what you think about it.
NOTE: It is not the class diagram but they have some overlapping.

Text Data

@hosseinfani
Copy link
Member Author

Hi @hamedwaezi01, this is awesome. Thank you.
Everything seems clear to me. Just the following notes:

  • text data -> raw data?
  • not sure if the csv format is necessary when we can read from xml?
  • dataset -> input
  • why the non-recurrent models cannot have embeddings in the input?
  • activation -> output

Also, would be great if you add hint to the code file paths so the blocks can be found easily in the codeline too.

Btw, we need an experiment on early detection, meaning that how much of a conversation is needed to detection predatory one. remind me to discuss it more if not clear.

@hamedwaezi01
Copy link
Member

Hi @hosseinfani. Thanks. Sorry again for the late reply.
you're right. Raw data is more accurate.

  • About using the XML file
    Since we use pandas DataFrame in preprocessing steps, it is better to mention that we are gonna convert the XML to a DataFrame without loss of data and save it as CSV.
    Also in our MVP baseline, we converted the XML to CSV too.

  • dataset -> input
    what about "Input Features"

  • why the non-recurrent models cannot have embeddings in the input?
    Actually I have to add it too. I think missed it. Additionally, There should be a separate box for fine-tuned BERT models and the respective datasets.

  • activation -> output
    Good idea. previously I had doubts since "output" might be confused with number of outputs or its configurations.

  • early detection
    Yes, there were a couple of papers about it. We need to list a couple of metrics that measure it and then proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants