Implementation for the paper submitted to ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM).
Synthesizing Sentiment-Controlled Feedback For Multimodal Text and Image Data
Puneet Kumar, Sarthak Malik, Balasubramanian Raman, and Xiaobai Li.
The code files are currently private and will be made public after the acceptance/publication of the corresponding paper.
The Controllable Multimodal Feedback Synthesis (CMFeed) Dataset has been compiled by Puneet Kumar and Sarthak Malik under the supervision of Prof. Balasubramanian Raman and Prof. Xiaobai Li. It contains 61,734 samples from 3,646 posts compiled by crawling news articles from Sky News, NYDaily, FoxNews, BBC News, and BBC NW through Facebook posts. The dataset includes multiple images per sample, corresponding news text, post likes and shares, and human comments. The comments for each post have been sorted based on Facebook's 'most-relevant' criterion.
This data has been collected manually from news websites and corresponding Facebook posts in the public domain, adhering to Facebook's terms and conditions. Facebook prohibits the automatic scraping of its users' personal data. In compliance with this policy, we implement the following steps while constructing the CMFeed dataset:
- Manual Data Crawling: All data is manually collected from Facebook, ensuring strict avoidance of any automated scraping processes. This is in-line with the guidelines for Facebook data scraping.
- Public Data Collection: We collect data that is publicly available, specifically corresponding to news articles that are freely accessible following the protocols for ethically scraping Facebook data.
The following steps outline the comprehensive process used for collecting and preparing the CMFeed dataset, ensuring the dataset is of high quality and suitable for controlled multimodal feedback synthesis.
- Objective: Manually extract essential data elements from publicly available Facebook news pages.
- Data Collected:
News_text
: The textual content of each news post.News_link
: Direct URL to the full news article.Post_shares
: Total number of shares the post received.Post_reaction
: Count of reactions (like, love, etc.) on the post.Comment
: Text content of top-ranked comments.Comment_like
: Like count for each comment.Comment_reaction_rank
: Ranking of comments based on reaction counts.Comment_link
: Direct URL to specific comments, if available.Comment_rank
: Ranking of comments based on overall engagement and relevance.
- Tools Used: Selenium WebDriver, controlled manually.
- Procedure:
- Manually operate a web browser script to visit each
News_link
. - Manually collect all images present on the news article page for each link visited.
- Manually operate a web browser script to visit each
- Objective: Compile all collected data, including human feedback elements.
- Data Aggregated:
- Combine manually gathered Facebook data with images collected from news articles to complete the dataset.
- Models Utilized:
- FLAIR
- SentimentR
- RoBERTa
- DistilBERT
- Procedure:
- Process collected comments through the four sentiment classification models.
- Generate binary sentiment scores (
0
: negative,1
: positive) and corresponding probability scores for each comment.
- Objective: Derive a refined sentiment classification for each comment.
- Procedure:
- Use weighted addition to combine binary scores and probability scores from all models, forming a final
Sentiment_class
andSentiment_score
for each comment.
- Use weighted addition to combine binary scores and probability scores from all models, forming a final
- Objective: Enhance the reliability and accuracy of the dataset.
- Filtering Methods:
- Model Agreement Filtering: Only retain comments where at least three out of the four models concur on the sentiment.
- Probability Range Safety Margin: Exclude comments with sentiment probabilities between 0.49 and 0.51, indicating ambiguous or uncertain sentiment determinations.
This approach adheres strictly to manual operations to comply with data collection regulations, ensuring a robust and reliable dataset for controllable multimodal feedback synthesis.
Various parameters of the CMFeed dataset have been described in the following table.
Representative samples from the CMFeed dataset are shown in the following figure where 'Post Likes' and 'Comment Likes' show the number of likes for the post and comment, respectively. 'Share' denotes post shares and Senti-class
represents comment's sentiment (1
: positive, $0
: negative).
Access to the CMFeed dataset can be obtained at https://zenodo.org/records/11409612.