Open-Brand: Open Brand Value Extraction from Product Descriptions

The dataset contains over 250k product brand-value annotations with more than 50k unique values across eight main categories of Amazon product profiles. For more details, refer to our ACL-ECNLP 2022 paper.

The dataset is in JSON Lines format, with the following schema:

{
   "asin": <product unique id>,
   "category": <category name>,
    "description": <product description text>,
    "brand": <brand value inside description text>,
    "tag": [list of tagged product descriptions for NER]
}

The 'asin' product id is the same as in the All_Amazon_Meta.json file in the Amazon Review Data (2018). To obtain the full version of the dataset see below.

The json object contains a tagged product with the relevant Brand. An example is shown as follows:

{
   "asin": "B07ZNDB9RG",
   "category": "Toys and Games",
   "description": "Wooden Stacking Board Games 54 Pieces
   for Kids Adult and Families, Gentle Monster Wooden Blocks Toys for Toddlers, Colored Building Blocks - 6 Colors 2 Dice"
   "brand": "Gentle Monster"
   "tag": ["O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "B-Brand", "I-Brand", "O", "O", "O", "O", "O",...]
}

Data Description

The following table shows the statistics of all of our dataset:

Category	Number of Samples
Automotive	66837
Cell Phones and Accessories	78564
Clothing, Shoes, and Jewelry	85068
Electronics	47870
Grocery and Gourmet Food	22397
Pet Supplies	10868
Sports and Outdoors	54214
Toys and Games	63304

The following graph shows the number of tokens that constitute a brand inside the main dataset:

Obtaining the Dataset

Clone the repo
Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.
Run script
```
bash ./extract_data.sh
```

The above shell script runs the python script extract_data.py which uses an apache beam pipeline to process the data.

The output of the full datasets will be stored in the datasets/ folder as a json file. The produced datasets are az_base_dataset.jsonl and az_new_cat_dataset.jsonl respectively.

Reading the Data in DataFrames

A simple script to read any of the produced JSON Lines data in datasets/ directory as pandas dataframe:

import json
import pandas as pd

def getDF(path):
   with open('path') as f:
      i = 0
      df = {}
      for line in f:
         df[i] = json.loads(line)
         i += 1
   return pd.DataFrame.from_dict(df, orient='index')
   
df = getDF('datasets/az_base_dataset.jsonl')

If you found this work useful or used the datasets, please cite it as follows:

@inproceedings{sabeh-etal-2022-openbrand,
    title = "{O}pen{B}rand: Open Brand Value Extraction from Product Descriptions",
    author = "Sabeh, Kassem  and
      Kacimi, Mouna  and
      Gamper, Johann",
    booktitle = "Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.ecnlp-1.19",
    doi = "10.18653/v1/2022.ecnlp-1.19",
    pages = "161--170",
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
graphs		graphs
labels		labels
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_data.py		extract_data.py
extract_data.sh		extract_data.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open-Brand: Open Brand Value Extraction from Product Descriptions

Data Description

Obtaining the Dataset

Reading the Data in DataFrames

About

Releases

Packages

Languages

License

kassemsabeh/open-brand

Folders and files

Latest commit

History

Repository files navigation

Open-Brand: Open Brand Value Extraction from Product Descriptions

Data Description

Obtaining the Dataset

Reading the Data in DataFrames

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages