Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate text profiles from ddprofiler #76

Open
snowgy opened this issue May 10, 2024 · 0 comments · May be fixed by #77
Open

Duplicate text profiles from ddprofiler #76

snowgy opened this issue May 10, 2024 · 0 comments · May be fixed by #77

Comments

@snowgy
Copy link
Collaborator

snowgy commented May 10, 2024

The text profiles produced by the ddprofiler contain duplicate column profiles, making the dindex_builder take an extra long time and disk space to create the full-text search index.

Reproduce this issue:

  1. Download chicago open data. https://uchicago.box.com/s/ecmb69h874qwedj19ebncvu0qvd4n97h
  2. Follow the quick start guide to index the data
  3. Check the output_profiles_json/text

For example, in 0.csv, you can find the month_name in x2vd-qke7.csv is indexed twice.

"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"

Since dindex_builder reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.

@snowgy snowgy linked a pull request May 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant