Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding predatory conversations to PAN2012 #20

Open
rezaBarzgar opened this issue Feb 8, 2023 · 6 comments
Open

adding predatory conversations to PAN2012 #20

rezaBarzgar opened this issue Feb 8, 2023 · 6 comments
Assignees

Comments

@rezaBarzgar
Copy link
Member

rezaBarzgar commented Feb 8, 2023

The PAN dataset is for 2012. We know that predatory conversations of PAN are from perverted-justice. However, Here are some conversations that had been added after 2012.
We should check these conversations. If they are not in the dataset, I think it's a good idea to add them.

@rezaBarzgar rezaBarzgar added the question Further information is requested label Feb 8, 2023
@hosseinfani
Copy link
Member

@rezaBarzgar
Nice catch. Do you think this would be a task for @EhsanSl , that is writing a crawler and creating a new dataset with the same XML format as the old one?

@rezaBarzgar
Copy link
Member Author

@hosseinfani
I think that's a good idea. These conversations can be useful.

@rezaBarzgar rezaBarzgar removed the question Further information is requested label Feb 8, 2023
@rezaBarzgar
Copy link
Member Author

@EhsanSl
Hi Ehsan, I assigned you to this issue.

The objective is to expand the dataset with conversations that are not in PAN2012. As Dr. Fani mentioned, you must write a crawler that extracts new data from perverted justice with the same XML format.

We can have a meeting if you need any information.

Please submit your progress on this issue

@EhsanSl
Copy link
Member

EhsanSl commented Feb 13, 2023

nice try guys! 🥲
image

@EhsanSl
Copy link
Member

EhsanSl commented Feb 14, 2023

Hi dear Reza, I hope everything is going well brother,
I haven't forgotten about the tasks, but frankly, I'm extremely swamped this week, and have a few too many deadlines, I'll make up for it from the reading week. x)
 

@EhsanSl
Copy link
Member

EhsanSl commented Mar 11, 2023

Hi dear Dr. Hossein and dear Reza
I attached the crawler here neuralcrawing_.zip
, there are a few points that are worth mentioning
[]since in all the chat logs, the conversations are not separated by a distinct box, there was not a definite way to extract each individually (to my understanding),
[] the date is sometimes placed inside the conversation div and sometimes between them, which makes it a bit challenging
[] the time formatting is not consistent for all the convicts
[] to access the crawler file: neuralcrawling/neuralcrawling/spiders/justice_spider.py
[] to run the crawler, make sure the terminal path is: 'neuralcrawling/neuralcrawling'
then run the following command in the terminal: scrapy crawl jspider -o output.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants