Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a cache of Postgres system objects #185

Open
jmafc opened this issue Mar 2, 2018 · 7 comments
Open

Build a cache of Postgres system objects #185

jmafc opened this issue Mar 2, 2018 · 7 comments

Comments

@jmafc
Copy link
Member

jmafc commented Mar 2, 2018

Issue #175, the test_operclass.py tests that were marked xfail in the last phase of dealing with #176 and issue #183 (probably also #184) all point to deficiencies in Pyrseas knowledge of PG system types and functions, such as integer and tsvector_concat. These objects reside in the pg_catalog schema and Pyrseas has always avoided fetching those objects.

I doubt that more than a few users CREATE OPERCLASSes on something like the pg_catalog.integer type as shown in those xfail tests. Maybe a sophisticated extension like PostGIS does, but they're an exception. OTOH, usage of ts_vector functions is more likely as seen in #175 and in some test_trigger.py tests that use tsvector_update_trigger. Issues #183 and #184 request display of attribute information that may go beyond the PG system types, since there may be user defined record-types (table row-types).

The proposal is to retrieve very basic information, probably just oid and name, at initialization, so that it can be used in subsequent processing. The precedent for this is Post-Facto, but the difference is that whereas PF was connected live to both source and target databases, yamltodb only has access to the target oids. If we limit it to system types, it is highly unlikely that the oid for tsvector_concat in one database will differ from the other.

@jmafc
Copy link
Member Author

jmafc commented Sep 24, 2022

@greggailly Regarding our discussion under #251.
If I were to attempt to implement this, I'd first grep through the files in pyrseas/dbobject for 'pg_catalog'. Most of the results will turn to be like this:

WHERE (nspname != 'pg_catalog' AND ...

What needs to be done as a first step is to remove that first part of that WHERE clause. That will result in retrieving everything in the pg_catalog schema. Since you have a specific use case in the aggregate described in #251, you could just change the query starting at about line 99 of dbobject/dbtype.py. BTW, I've tried this and it doesn't fetch the row for 'anyarray'. The problem appears to be that the query includes AND typtype = 'b' but 'anyarray' is a pseudo-type, so that would have to be changed to AND typtype IN ('b', 'p').

Once this change is made, I would test it against an empty database to see what dbtoyaml produces as output. It may break elsewhere because I don't think we've catered for pseudo-types. 'anyarray' may also be excluded because aside from the SQL queries, there may be other places (in Python) where we exclude the pg_catalog objects. Once those possible issues are resolved or worked around, then I would test it against your aggregate.

Hope this helps.

@benjamin-kirkbride
Copy link
Contributor

@jmafc do we have any idea how big of a lift doing this would be?

@jmafc
Copy link
Member Author

jmafc commented Mar 23, 2023

"this"? If by "this" you mean what Daniele suggested (which you have essentially done) and fixing loose ends such as the specific problem of "misplaced" functions, I suspect it may not be too difficult to come up with some fix or general workaround. But (there's always one of those), there could be something more ragged. However, the problem of #165 may be easier to solve now than when the issue was first opened, because in the interim we have done away with treating schema public as a special case [although looking at CatDbConnection.connect, I don't recall why we do what we do there--it seems like we ought to not set search_path at all, and then presumably PG would have to return all qualified names].

@benjamin-kirkbride
Copy link
Contributor

@jmafc I'm asking specifically about having a cache of all Postgres system objects, sounded like there were a lot of gotchas there, and this issue has been open a long time. Sorry if I'm being dense and not understanding

@jmafc
Copy link
Member Author

jmafc commented Mar 23, 2023

No, I'm the one who needs to apologize. I thought your comment was under issue #244 so I thought "this" meant fixing that issue.

This, i.e., the cache, is probably like lifting 100 kg or more. Looking at one of my medium sized databases where \d shows about 50 rows, pg_class has a little over 100 rows in the public and pg_toast schemas, which I believe is primarily because of indexes and varchar/text columns, respectively, but pg_catalog has over 250 rows. In pg_attribute, there are about 600 rows each in public and pg_toast for an average of six attributes per relation, but there are nearly 1900 rows in pg_catalog. And reading in pg_catalog rows isn't probably where the big lift is (although it will affect performance), but then having to decide (a) in dbtoyaml whether to include a particular object in the output and (b) in yamltodb how to reconcile these normally "hidden" objects.

@benjamin-kirkbride
Copy link
Contributor

Is this something you have a clear vision for, or is the above description about as far into planning/design as it's gotten?

@jmafc
Copy link
Member Author

jmafc commented Mar 24, 2023

I may have had a "vision" five years ago, but it seems rather blurry right now, not only due to the passage of time, but also to other changes and the fact I haven't really been thinking a lot about Pyrseas in the meantime. As far as I can recall, the idea was to read in all the catalogs and build a cache, either in memory or YAML file, so that, for example, when we found a pg_catalog.ts_vector in a user object, we could link to it and output it or use it as a dependency as we do other with other objects.

At this point, in fact, if I were to start trying to implement something I'd start by building a test case for #175 and use TDD to implement the minimal stuff necessary to get the test to pass. Then based on what I may have learned from that exercise, I would try to generalize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants