Skip to content

Under review in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Notifications You must be signed in to change notification settings

MIntelligence-Group/CMFeed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CMFeed: A Benchmark Dataset for Controllable Multimodal Feedback Synthesis

Implementation for the paper submitted to ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM).
Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data
Puneet Kumar, Sarthak Malik, Balasubramanian Raman, and Xiaobai Li.

Code Files

The code files are currently private and will be made public after the acceptance/publication of the corresponding paper.

Dataset Details & Access

The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset has been compiled by Puneet Kumar and Sarthak Malik under the supervision of Prof. Balasubramanian Raman and Prof. Xiaobai Li. It contains 61,734 samples from 3,646 posts compiled by crawling news articles from Sky News, NYDaily, FoxNews, BBC News, and BBC NW through Facebook posts. The dataset includes multiple images per sample, corresponding news text, post likes and shares, and human comments. The comments for each post have been sorted based on Facebook's 'most-relevant' criterion.

Compliance with Facebook's Terms & Conditions

This data has been collected manually from news websites and corresponding Facebook posts in the public domain, adhering to Facebook's terms and conditions. Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we implement the following steps while constructing the CMFeed dataset:

  1. Manual Data Crawling: All data is manually collected from Facebook, ensuring strict avoidance of any automated scraping processes. This is in-line with the guidelines for Facebook data scraping.
  2. Public Data Collection: We collect data that is publicly available, specifically corresponding to news articles that are freely accessible following the protocols for ethically scraping Facebook data.

Steps Involved in Data Collection

The following steps outline the comprehensive process used for collecting and preparing the CMFeed dataset, ensuring the dataset is of high quality and suitable for controlled multimodal feedback synthesis.

Step 1: Manual Crawling of Facebook News Pages

  • Objective: Manually extract essential data elements from publicly available Facebook news pages.
  • Data Collected:
    • News_text: The textual content of each news post.
    • News_link: Direct URL to the full news article.
    • Post_shares: Total number of shares the post received.
    • Post_reaction: Count of reactions (like, love, etc.) on the post.
    • Comment: Text content of top-ranked comments.
    • Comment_like: Like count for each comment.
    • Comment_reaction_rank: Ranking of comments based on reaction counts.
    • Comment_link: Direct URL to specific comments, if available.
    • Comment_rank: Ranking of comments based on overall engagement and relevance.

Step 2: Extraction of News Content via Web Browser

  • Tools Used: Selenium WebDriver, controlled manually.
  • Procedure:
    • Manually operate a web browser script to visit each News_link.
    • Manually collect all images present on the news article page for each link visited.

Step 3: Complete Data Collection from Online Sources

  • Objective: Compile all collected data, including human feedback elements.
  • Data Aggregated:
    • Combine manually gathered Facebook data with images collected from news articles to complete the dataset.

Step 4: Sentiment Analysis Using Advanced NLP Models

  • Models Utilized:
    • FLAIR
    • SentimentR
    • RoBERTa
    • DistilBERT
  • Procedure:
    • Process collected comments through the four sentiment classification models.
    • Generate binary sentiment scores (0: negative, 1: positive) and corresponding probability scores for each comment.

Step 5: Calculate Combined Sentiment Scores

  • Objective: Derive a refined sentiment classification for each comment.
  • Procedure:
    • Use weighted addition to combine binary scores and probability scores from all models, forming a final Sentiment_class and Sentiment_score for each comment.

Step 6: Implement Data Filtering Strategies

  • Objective: Enhance the reliability and accuracy of the dataset.
  • Filtering Methods:
    • Model Agreement Filtering: Only retain comments where at least three out of the four models concur on the sentiment.
    • Probability Range Safety Margin: Exclude comments with sentiment probabilities between 0.49 and 0.51, indicating ambiguous or uncertain sentiment determinations.

This approach adheres strictly to manual operations to comply with data collection regulations, ensuring a robust and reliable dataset for controllable multimodal feedback synthesis.

Data Samples and Further Details

Various parameters of the CMFeed dataset have been described in the following table.

Representative samples from the CMFeed dataset are shown in the following figure where 'Post Likes' and 'Comment Likes' show the number of likes for the post and comment, respectively. 'Share' denotes post shares and Senti-class represents comment's sentiment (1: positive, $0: negative).

Access to the CMFeed dataset can be obtained at https://zenodo.org/records/11409612.

About

Under review in ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published