Added db.select_from_TABLE methods #1828

dsblank · 2024-12-21T11:45:34Z

This PR adds methods designed to be implemented in a low-level DB system, like SQL. The human-facing code is all Python, and gets parsed into SQL. All of the code that is converted into SQL is written as strings. This allows coders to write in the same syntax that is supported by the DataDict interface (minus the object-creation variation).

For example, you could select all of the male people with:

db.select_from_person(where="person.gender == Person.MALE")

(Person is defined in the environment evaluated in.)

By default, the methods returns a Gramps object per row. But you can optionally select one attribute ("person.handle") or a list of attributes (["person.handle", "person.gramps_id"]) using the what parameter.

All arguments are optional.

Further Examples:

db.select_from_person(where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person("person.handle", where="person.handle == 'A6E74B3D65D23F'")
db.select_from_person(
    what=["person.handle", "person.gramps_id"],
    where="person.handle == 'A6E74B3D65D23F'"
    order_by=[("person.gramps_id", "DESC")]
    env={"Person": Person}
)

gramps/plugins/db/dbapi/dbapi.py

dsblank · 2024-12-26T15:11:30Z

I was going to not add the what parameter, but it takes a significant amount of time to return and json.loads() the entire object, rather than just have SQL select parts of it. Here we see that just getting the handle is 4 times faster than getting the whole JSON data.

In [8]: %%timeit
   ...: for data in db.select_from_person():
   ...:     pass
   ...: 
39.1 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [9]: %%timeit
   ...: for data in db.select_from_person("person.handle"):
   ...:     pass
   ...: 
10.4 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

gramps/gen/db/generic.py

dsblank · 2024-12-27T17:45:21Z

@Nick-Hall and other reviewers, I've been thinking for a very long time about how to add a select-style method in Gramps, while keeping with a Python interface. This PR represents the best that I can come up with.

Note that the syntax for the SELECT fields, WHERE clause and ORDER BY fields are all represented as strings of Python expressions. (I was hoping that they wouldn't be strings. I tried lambda functions, but couldn't reliably get the AST from the body of the lambda. I didn't want to use functions for all of the fields and clauses, because that would be a lot of overhead for the user. I also tried expressions like ("$.gramps_id", "==", "I005") but that wasn't very Pythonic. So, this is my best solution given the constraints.)

The strings are parsed by Python into an Abstract Syntax Tree (ast) that is then used to generate the SQL syntax. I wrote the Evaluator with different DB engines in mind, in case they use different syntax for JSON extraction, etc. The code is fairly minimal and with low complexity to make it easy to maintain and extend.

Let me know if you have concerns or ideas for improvement.

dsblank · 2024-12-27T18:03:30Z

@DavidMStraub this could serve as a replacement in gramps-web for both gramps-ql and object-ql as it is converted into SQL. (It doesn't yet allow everything that the others do though).

dsblank · 2024-12-28T13:09:42Z

One thing that I realize that this doesn't respect is filters and proxies. But I think that can be fixed. Some options:

Fallback to using standard gramps access if there is a proxy or filter
After selection, see if object is accessible in filter/proxy

Other ideas?

dsblank · 2024-12-28T14:18:41Z

@Nick-Hall, actually, I'm realizing that we have a bigger issue: if you have a proxy/filter in place, then you might not be able to access all of the items in the JSON data.

That means that:

person_data.family_list != person_object.family_list

if a family does not appear in the filter/proxy.

It could be that if we have a filter or proxy, we must force the DataDict to generate the object through methods like db.get_person_from_handle().

DavidMStraub · 2024-12-28T14:28:02Z

If I may ... I think it's really great that so much refactoring and improvement is happening, but I find it a bit strange that so many things are merged so quickly without (sorry - at least my impression) considering all the implications (I was triggered by the example with proxies and filters), while at the same time my simple PR which does nothing but enable static type checking has been open for half a year. Static type checking would make the refactoring less dangerous.

dsblank · 2024-12-28T15:58:49Z

@DavidMStraub, nothing has been merged yet that has any effect on the implications I have raised above. The implication is for the things being considered for merging. It would be great if we had more developers (like yourself) that would be able to comment on such implications. So, no, things aren't being merged "too quickly" and without thinking about consequences.

Working on the what is next gives us insight into complex issues. So no need to get triggered by such a realization.

Regarding type checking: yes, I would have merged that PR many months ago because I am very familiar with the benefits of typing, and realize there are no down sides.

But also, the implication above is the realization that a "type" (eg, Person) doesn't capture the essence of the issue. A Person object is an API to possibly altered and hidden properties. That is hidden in the API. So a Person created directly from the data isn't the same kind of Person we get from db methods. We might be able to create different types to catch such errors, but we don't even have the concepts for such types yet.

In any event, we need to refactor this PR, and the filter refactor PR. And probably adjust the DataDict class to make sure we don't access items that we shouldn't.

stevenyoungs · 2024-12-28T16:35:15Z

One of your optimisations is to keep the data in a DataDict unless the true object is required - but at least you have the full set of data in memory. With this db.select_from_TABLE PR, you only have partial object data data in memory. Therefore you have insufficient data to determine in all cases if a record should be returned \ "sanitized" by the proxy db.
From a quick scan, the set of ProxyDbBase classes are used in report \ export scenarios where performance is (perhaps) less critical? Perhaps then these proxy db classes force db.select_from_TABLE to read all data that is required to correctly run the filter?
This would hopefully retain the performance gains in scenarios where no proxy is in use.

Nick-Hall · 2024-12-28T21:19:41Z

@dsblank This PR reminds me of the db.collcetion.find method in MongoDB. It may be worth a quick look if you are unfamiliar with it. You may get some ideas.

I like how you have made the query pythonic. This is better than previous SQL-like designs and the JSON queries of MongoDB.

@DavidMStraub We seem to have been discussing this on and off for about 7 or 8 years now, so I don't think that the progress is too fast. There have also been a couple of prototypes. The static type checking PR makes changes to 51 files. I tend to leave this type of change until fairly close to release in order to avoid potential conflicts when merging up fixes from the maintenance branch. Also the smaller changes tend to be easier to fit in when I have time available. Your PR is on my schedule though.

@stevenyoungs Yes. Proxies are mainly used in the report and export code. I don't mind if these are not optimised to use the new code, but we must make sure that they don't run significantly slower than at present. Some people already have to wait a long time for certain reports to run.

I don't regard this PR as essential for the next release, but it may be worth continuing to investigate our options.

stevenyoungs · 2024-12-31T17:56:56Z

Bah! @stevenyoungs pointed out that the proxies properly process raw data. All my worrying above, and some of my comments about DataDict and raw data are wrong. DataDict should be perfectly fine to use throughout gramps to save a bit of time. (Well, except in proxies, they probably take a bit longer... not sure).

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

dsblank · 2025-01-03T16:24:35Z

the get_raw_* functions in a proxy look like they create the true object in order to do the filtering. So I agree, not efficient, but should give the correct result.

#1839 will allow efficient get_raw_* functions in proxies.

dsblank · 2025-01-24T16:15:03Z

Asked for comments on the gramps-dev mailing list.

Nick-Hall · 2025-01-27T00:12:27Z

The mailing list thread is called A DB select method that can be engine-optimized. I'll start the discussion if nobody replies, but I didn't want to influence people by posting my opinions first.

kulath · 2025-01-27T01:13:19Z

I am not sure what the intented use of DataDict is, but I recommend using Primary object types rather than DataDicts wherever possible (as return types and arguments of functions), because otherwise I suspect we'll be seeing a lot of AttributeErrors or alternatively have to sprinkle the code with assert hasattr. (And it will be impossible to type check.)

Off topic, but I think this also applies to the Switch from pickled blobs to JSON data #1786 change https://github.com/gramps-project/gramps/pull/1786/files

dsblank · 2025-01-29T11:58:16Z

I'll start the discussion if nobody replies,

That would be great.

dsblank added 6 commits December 11, 2024 15:48

Added a dict wrapper that acts like an object

7a8afc5

Linting

4fc46b7

Convert to object if str(data)

f86bc4c

Linting

771cdd2

Added 11k tests

de3e0a8

Added a version of select using string+ast

1f0287d

dsblank changed the title ~~Added db.dbi.select using ast~~ Added db.dbapi.select using ast Dec 21, 2024

Linting

1491bf9

dsblank requested a review from Nick-Hall December 22, 2024 13:58

Nick-Hall and others added 3 commits December 22, 2024 09:01

Add gen.db.conversion_tools from PR #1786

5647610

Convert to object if str(data)

ecfd747

Merge branch 'master' into dsb/added-select-via-ast

768e608

dsblank self-assigned this Dec 22, 2024

dsblank added the enhancement label Dec 22, 2024

dsblank mentioned this pull request Dec 26, 2024

Update database method names #1829

Closed

stevenyoungs reviewed Dec 26, 2024

View reviewed changes

gramps/plugins/db/dbapi/dbapi.py Outdated Show resolved Hide resolved

Added select what, added to generic

ce9cfeb

dsblank marked this pull request as ready for review December 26, 2024 14:56

dsblank requested a review from stevenyoungs December 26, 2024 14:57

dsblank changed the title ~~Added db.dbapi.select using ast~~ Added db.select_from_TABLE methods Dec 26, 2024

stevenyoungs reviewed Dec 26, 2024

View reviewed changes

gramps/gen/db/generic.py Outdated Show resolved Hide resolved

dsblank added 5 commits January 1, 2025 08:50

Fixed two bugs: IN, and return OBJECT

5f2b31e

Added '_' as object; added len(person.family_list)

2731f58

WIP: adding tests

14e681f

More tests

8ec3701

Test DbGeneric

39750bf

dsblank added 18 commits January 3, 2025 11:30

Always use table_name for _

652fe64

Use correct quotes for Python 3.9 sqlite

e0dc3f0

Bumping CI to ubuntu 21.04

9577897

Bumping CI to ubuntu 22.04

e785303

Adjust package names for ubuntu-22.04

77ec49c

Adjust package names for ubuntu-22.04

f54d6d9

Show SQL command on error

c917535

Change syntax of order_by: '-person.gender'

fdecf87

Print out sqlite versions

5804e3a

Skip tests if no support for json_array_length

1ecb3bf

Install pytest

f6b4c5f

Try a variation of json_array_length

dbfab2e

Unroll json_extract

1cdd63a

Put everything back

5204a60

Finished adding tests

93529ef

Removed comment

939c968

Fix for asking for _ or person

923accb

Fix bug in DbGeneric._select_from_table

2c354be

dsblank added the performance label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added db.select_from_TABLE methods #1828

Added db.select_from_TABLE methods #1828

dsblank commented Dec 21, 2024 •

edited

Loading

dsblank commented Dec 26, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 28, 2024

dsblank commented Dec 28, 2024

DavidMStraub commented Dec 28, 2024

dsblank commented Dec 28, 2024

stevenyoungs commented Dec 28, 2024

Nick-Hall commented Dec 28, 2024

stevenyoungs commented Dec 31, 2024

dsblank commented Jan 3, 2025

dsblank commented Jan 24, 2025

Nick-Hall commented Jan 27, 2025

kulath commented Jan 27, 2025

dsblank commented Jan 29, 2025

Added db.select_from_TABLE methods #1828

Are you sure you want to change the base?

Added db.select_from_TABLE methods #1828

Conversation

dsblank commented Dec 21, 2024 • edited Loading

dsblank commented Dec 26, 2024 • edited Loading

dsblank commented Dec 27, 2024 • edited Loading

dsblank commented Dec 27, 2024 • edited Loading

dsblank commented Dec 28, 2024

dsblank commented Dec 28, 2024

DavidMStraub commented Dec 28, 2024

dsblank commented Dec 28, 2024

stevenyoungs commented Dec 28, 2024

Nick-Hall commented Dec 28, 2024

stevenyoungs commented Dec 31, 2024

dsblank commented Jan 3, 2025

dsblank commented Jan 24, 2025

Nick-Hall commented Jan 27, 2025

kulath commented Jan 27, 2025

dsblank commented Jan 29, 2025

dsblank commented Dec 21, 2024 •

edited

Loading

dsblank commented Dec 26, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading

dsblank commented Dec 27, 2024 •

edited

Loading