Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New db to speed up full text queries and library updates #114

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

dougy83
Copy link
Contributor

@dougy83 dougy83 commented Feb 19, 2024

All the .json.gz and .json.stock files have been processed into three compressed tables/files, downloadable as a single 20MB tar archive; these tables are copied directly into IndexedDB without decompression. The database update now takes ~150ms + time to download a 20MB file.

Running queries on IndexedDB for full text search previously took around 15 seconds for "Select". These queries on this new db take around 2 seconds for "Select", and act on the compressed tables. Note: the query takes ~1.5 seconds, but an extra variable amount of time is lost to allow aborting the query, and yielding to the UI.

The code to process the .json.gz/.stock.json files is in javascript, and takes around 15 seconds to complete.

I'm currently using gzip for compression. I tried with lz4, which was 3x faster, until I changed any of the code to allow streaming decompression. Could be worth looking at in future, when lz4 is natively supported by browsers.

@dougy83 dougy83 marked this pull request as ready for review February 23, 2024 14:10
@yaqwsx
Copy link
Owner

yaqwsx commented Feb 24, 2024

First, let me say a huge thank your for the work you are putting into this and the other related PR. I really like it and appreciate it.

I have, however, one suggestion to make. I suggest removing [generateJsonlinesDatabaseFiles.js](https://github.com/yaqwsx/jlcparts/pull/114/files#diff-4afa54d033975e4da9f76722851b166de044d8bca7ebfcd6d9221d0113600563). And also removing the old JSON files. Having them is just an extra burden, inefficiency, and overall messy IMHO (why keep something we don't need?). Also, having data processing in two languages is an extra maintenance burden.

We have a SQLite DB that has all the components stored in a raw format. From that, we generate the per-category JSON files. We already have a library abstraction that hides the DB and allows you to operate in "part" and "category" terms. The whole generator is in

lib = PartLibraryDb(library)
Path(outdir).mkdir(parents=True, exist_ok=True)
clearDir(outdir)
total = lib.countCategories()
categoryIndex = {}
params = []
for (catName, subcategories) in lib.categories().items():
for subcatName in subcategories:
params.append(MapCategoryParams(
libraryPath=library, outdir=outdir, ignoreoldstock=ignoreoldstock,
catName=catName, subcatName=subcatName))
with multiprocessing.Pool(jobs or multiprocessing.cpu_count()) as pool:
for i, result in enumerate(pool.imap_unordered(_map_category, params)):
if result is None:
continue
catName, subcatName = result["catName"], result["subcatName"]
print(f"{((i) / total * 100):.2f} % {catName}: {subcatName}")
if catName not in categoryIndex:
categoryIndex[catName] = {}
assert subcatName not in categoryIndex[catName]
categoryIndex[catName][subcatName] = {
"sourcename": result["sourcename"],
"datahash": result["datahash"],
"stockhash": result["stockhash"]
}
index = {
"categories": categoryIndex,
"created": datetime.datetime.now().astimezone().replace(microsecond=0).isoformat()
}
saveJson(index, os.path.join(outdir, "index.json"), hash=True)

I have two suggestions:

  • let's generate whatever format you need in Python by rewriting the datafiles generating. This is the preferred option.
  • let's discard the original datatables and let your script operate on top of the SQLite (not so preferred).

What do you think?

@dougy83
Copy link
Contributor Author

dougy83 commented Feb 24, 2024

First, let me say a huge thank your for the work you are putting into this and the other related PR. I really like it and appreciate it.

Most welcome. I have been using your page quite often recently, and I love it, so thank you for creating it.

I have, however, one suggestion to make. I suggest removing [generateJsonlinesDatabaseFiles.js](https://github.com/yaqwsx/jlcparts/pull/114/files#diff-4afa54d033975e4da9f76722851b166de044d8bca7ebfcd6d9221d0113600563). And also removing the old JSON files. Having them is just an extra burden, inefficiency, and overall messy IMHO (why keep something we don't need?). Also, having data processing in two languages is an extra maintenance burden.

If we remove the generateJsonlinesDatabaseFiles.js file, and put the processing in the python script, then there's no need for the original subcategory JSON files to be created. If they're not used by anything else, then of course, they can go.

I wrote it in js, as it's easier for me, as I don't program in python. I can add it to the python script; it shouldn't be too complex.

We have a SQLite DB that has all the components stored in a raw format. From that, we generate the per-category JSON files. We already have a library abstraction that hides the DB and allows you to operate in "part" and "category" terms. The whole generator is in

lib = PartLibraryDb(library)
Path(outdir).mkdir(parents=True, exist_ok=True)
clearDir(outdir)
total = lib.countCategories()
categoryIndex = {}
params = []
for (catName, subcategories) in lib.categories().items():
for subcatName in subcategories:
params.append(MapCategoryParams(
libraryPath=library, outdir=outdir, ignoreoldstock=ignoreoldstock,
catName=catName, subcatName=subcatName))
with multiprocessing.Pool(jobs or multiprocessing.cpu_count()) as pool:
for i, result in enumerate(pool.imap_unordered(_map_category, params)):
if result is None:
continue
catName, subcatName = result["catName"], result["subcatName"]
print(f"{((i) / total * 100):.2f} % {catName}: {subcatName}")
if catName not in categoryIndex:
categoryIndex[catName] = {}
assert subcatName not in categoryIndex[catName]
categoryIndex[catName][subcatName] = {
"sourcename": result["sourcename"],
"datahash": result["datahash"],
"stockhash": result["stockhash"]
}
index = {
"categories": categoryIndex,
"created": datetime.datetime.now().astimezone().replace(microsecond=0).isoformat()
}
saveJson(index, os.path.join(outdir, "index.json"), hash=True)

I have two suggestions:

  • let's generate whatever format you need in Python by rewriting the datafiles generating. This is the preferred option.
  • let's discard the original datatables and let your script operate on top of the SQLite (not so preferred).

What do you think?

I think the first option is the best (i.e. change output format of the generator), as it's a simple change and leverages your existing code, which has proven to work, and handles a bunch of edge cases.

@dougy83
Copy link
Contributor Author

dougy83 commented Feb 24, 2024

The python script has been updated to generate the single output file, without creating the *.json.gz and *.stock.json file, and the javascript file has been nuked.

@dougy83
Copy link
Contributor Author

dougy83 commented Mar 15, 2024

@yaqwsx Is there anything further I should do to with this PR (or the others)?

@yaqwsx
Copy link
Owner

yaqwsx commented Mar 16, 2024

It's my turn; I'll review them ASAP. I just ran out of my time allocated for the OSS projet's maintenance as there was a lot of work on KiKit with the recent release of KiCAD v8.

@dougy83
Copy link
Contributor Author

dougy83 commented Mar 16, 2024

Ok, no worries. I'm not trying to push you (I can use my test site), I just wasn't sure if I'd done it wrong.

I did notice the auto-merge of the different PRs makes bad merges (they touch similar parts of the same files). If you can say which PRs you want to include, I can rebase those ones on each other, sequentially, and correct the bad merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants