New db to speed up full text queries and library updates #114

dougy83 · 2024-02-19T12:14:53Z

All the .json.gz and .json.stock files have been processed into three compressed tables/files, downloadable as a single 20MB tar archive; these tables are copied directly into IndexedDB without decompression. The database update now takes ~150ms + time to download a 20MB file.

Running queries on IndexedDB for full text search previously took around 15 seconds for "Select". These queries on this new db take around 2 seconds for "Select", and act on the compressed tables. Note: the query takes ~1.5 seconds, but an extra variable amount of time is lost to allow aborting the query, and yielding to the UI.

The code to process the .json.gz/.stock.json files is in javascript, and takes around 15 seconds to complete.

I'm currently using gzip for compression. I tried with lz4, which was 3x faster, until I changed any of the code to allow streaming decompression. Could be worth looking at in future, when lz4 is natively supported by browsers.

…e missing

…or queries.

…date db code.

…econds)

yaqwsx · 2024-02-24T08:52:30Z

First, let me say a huge thank your for the work you are putting into this and the other related PR. I really like it and appreciate it.

I have, however, one suggestion to make. I suggest removing [generateJsonlinesDatabaseFiles.js](https://github.com/yaqwsx/jlcparts/pull/114/files#diff-4afa54d033975e4da9f76722851b166de044d8bca7ebfcd6d9221d0113600563). And also removing the old JSON files. Having them is just an extra burden, inefficiency, and overall messy IMHO (why keep something we don't need?). Also, having data processing in two languages is an extra maintenance burden.

We have a SQLite DB that has all the components stored in a raw format. From that, we generate the per-category JSON files. We already have a library abstraction that hides the DB and allows you to operate in "part" and "category" terms. The whole generator is in

jlcparts/jlcparts/datatables.py

Lines 351 to 384 in 1a41275

    
           lib = PartLibraryDb(library) 
        
           Path(outdir).mkdir(parents=True, exist_ok=True) 
        
           clearDir(outdir) 
        
           total = lib.countCategories() 
        
           categoryIndex = {} 
        
           params = [] 
        
           for (catName, subcategories) in lib.categories().items(): 
        
               for subcatName in subcategories: 
        
                   params.append(MapCategoryParams( 
        
                       libraryPath=library, outdir=outdir, ignoreoldstock=ignoreoldstock, 
        
                       catName=catName, subcatName=subcatName)) 
        
           with multiprocessing.Pool(jobs or multiprocessing.cpu_count()) as pool: 
        
               for i, result in enumerate(pool.imap_unordered(_map_category, params)): 
        
                   if result is None: 
        
                       continue 
        
                   catName, subcatName = result["catName"], result["subcatName"] 
        
                   print(f"{((i) / total * 100):.2f} % {catName}: {subcatName}") 
        
                   if catName not in categoryIndex: 
        
                       categoryIndex[catName] = {} 
        
                   assert subcatName not in categoryIndex[catName] 
        
                   categoryIndex[catName][subcatName] = { 
        
                       "sourcename": result["sourcename"], 
        
                       "datahash": result["datahash"], 
        
                       "stockhash": result["stockhash"] 
        
                   } 
        
           index = { 
        
               "categories": categoryIndex, 
        
               "created": datetime.datetime.now().astimezone().replace(microsecond=0).isoformat() 
        
           } 
        
           saveJson(index, os.path.join(outdir, "index.json"), hash=True)

I have two suggestions:

let's generate whatever format you need in Python by rewriting the datafiles generating. This is the preferred option.
let's discard the original datatables and let your script operate on top of the SQLite (not so preferred).

What do you think?

dougy83 · 2024-02-24T09:10:21Z

First, let me say a huge thank your for the work you are putting into this and the other related PR. I really like it and appreciate it.

Most welcome. I have been using your page quite often recently, and I love it, so thank you for creating it.

I have, however, one suggestion to make. I suggest removing [generateJsonlinesDatabaseFiles.js](https://github.com/yaqwsx/jlcparts/pull/114/files#diff-4afa54d033975e4da9f76722851b166de044d8bca7ebfcd6d9221d0113600563). And also removing the old JSON files. Having them is just an extra burden, inefficiency, and overall messy IMHO (why keep something we don't need?). Also, having data processing in two languages is an extra maintenance burden.

If we remove the generateJsonlinesDatabaseFiles.js file, and put the processing in the python script, then there's no need for the original subcategory JSON files to be created. If they're not used by anything else, then of course, they can go.

I wrote it in js, as it's easier for me, as I don't program in python. I can add it to the python script; it shouldn't be too complex.

We have a SQLite DB that has all the components stored in a raw format. From that, we generate the per-category JSON files. We already have a library abstraction that hides the DB and allows you to operate in "part" and "category" terms. The whole generator is in

jlcparts/jlcparts/datatables.py

Lines 351 to 384 in 1a41275

lib = PartLibraryDb(library)

Path(outdir).mkdir(parents=True, exist_ok=True)

clearDir(outdir)

total = lib.countCategories()

categoryIndex = {}

params = []

for (catName, subcategories) in lib.categories().items():

for subcatName in subcategories:

params.append(MapCategoryParams(

libraryPath=library, outdir=outdir, ignoreoldstock=ignoreoldstock,

catName=catName, subcatName=subcatName))

with multiprocessing.Pool(jobs or multiprocessing.cpu_count()) as pool:

for i, result in enumerate(pool.imap_unordered(_map_category, params)):

if result is None:

continue

catName, subcatName = result["catName"], result["subcatName"]

print(f"{((i) / total * 100):.2f} % {catName}: {subcatName}")

if catName not in categoryIndex:

categoryIndex[catName] = {}

assert subcatName not in categoryIndex[catName]

categoryIndex[catName][subcatName] = {

"sourcename": result["sourcename"],

"datahash": result["datahash"],

"stockhash": result["stockhash"]

}

index = {

"categories": categoryIndex,

"created": datetime.datetime.now().astimezone().replace(microsecond=0).isoformat()

}

saveJson(index, os.path.join(outdir, "index.json"), hash=True)

I have two suggestions:

let's generate whatever format you need in Python by rewriting the datafiles generating. This is the preferred option.

let's discard the original datatables and let your script operate on top of the SQLite (not so preferred).

What do you think?

I think the first option is the best (i.e. change output format of the generator), as it's a simple change and leverages your existing code, which has proven to work, and handles a bunch of edge cases.

…on datatables

dougy83 · 2024-02-24T14:46:08Z

The python script has been updated to generate the single output file, without creating the *.json.gz and *.stock.json file, and the javascript file has been nuked.

dougy83 · 2024-03-15T23:23:51Z

@yaqwsx Is there anything further I should do to with this PR (or the others)?

yaqwsx · 2024-03-16T06:05:14Z

It's my turn; I'll review them ASAP. I just ran out of my time allocated for the OSS projet's maintenance as there was a lot of work on KiKit with the recent release of KiCAD v8.

dougy83 · 2024-03-16T07:25:56Z

Ok, no worries. I'm not trying to push you (I can use my test site), I just wasn't sure if I'd done it wrong.

I did notice the auto-merge of the different PRs makes bad merges (they touch similar parts of the same files). If you can say which PRs you want to include, I can rebase those ones on each other, sequentially, and correct the bad merges.

dougy83 added 8 commits February 19, 2024 21:51

Allow fetch requests to read from a single downloaded tar.gz file

1e2fc32

Add all-data.tar.gz creation to the buildtables function

415fae8

Split combined data file into two parts

1fd9ada

Allow using downloaded chunks of the combined, even if some chunks ar…

16c4251

…e missing

Squash database into 3 gz files. Single file download. Use gz files f…

0b2a27e

…or queries.

Generate new db combined data file in github action

21df052

Update first-time and update-available code for new db. Remove old up…

d4dacd6

…date db code.

Reverting inconsequential changes to datatables.py

5f1ce99

dougy83 force-pushed the new-db branch from 5cf58ab to 5f1ce99 Compare February 19, 2024 13:05

dougy83 added 2 commits February 20, 2024 02:07

Fix update download finished status

56f4d38

Update readme.md to include the generateJsonlinesDatabaseFiles step

e312765

dougy83 force-pushed the new-db branch from ad2bced to 32d1790 Compare February 20, 2024 13:25

Fix slow generateJsonlinesDatabaseFiles.js processing (now takes <15s…

26fbcbd

…econds)

dougy83 force-pushed the new-db branch from 32d1790 to 26fbcbd Compare February 20, 2024 13:32

dougy83 mentioned this pull request Feb 21, 2024

Update components makes a lot of requests #109

Open

dougy83 marked this pull request as draft February 23, 2024 12:02

Move generateJsonlinesDatabaseFiles.js to correct location

1a7899f

dougy83 marked this pull request as ready for review February 23, 2024 14:10

dougy83 mentioned this pull request Feb 24, 2024

Test site for new features #122

Open

Move all datatable processing into the python script. Don't create js…

6e2bf8b

…on datatables

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New db to speed up full text queries and library updates #114

New db to speed up full text queries and library updates #114

dougy83 commented Feb 19, 2024 •

edited

Loading

yaqwsx commented Feb 24, 2024

dougy83 commented Feb 24, 2024

dougy83 commented Feb 24, 2024

dougy83 commented Mar 15, 2024

yaqwsx commented Mar 16, 2024

dougy83 commented Mar 16, 2024

New db to speed up full text queries and library updates #114

Are you sure you want to change the base?

New db to speed up full text queries and library updates #114

Conversation

dougy83 commented Feb 19, 2024 • edited Loading

yaqwsx commented Feb 24, 2024

dougy83 commented Feb 24, 2024

dougy83 commented Feb 24, 2024

dougy83 commented Mar 15, 2024

yaqwsx commented Mar 16, 2024

dougy83 commented Mar 16, 2024

dougy83 commented Feb 19, 2024 •

edited

Loading