Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loader speed: Query and DDL optimization #195

Closed
alok87 opened this issue Apr 16, 2021 · 6 comments
Closed

Loader speed: Query and DDL optimization #195

alok87 opened this issue Apr 16, 2021 · 6 comments
Labels
p1 priority 1, do it ASAP performance Monitoring, Metrics, Logs, Benchmarks

Comments

@alok87
Copy link
Contributor

alok87 commented Apr 16, 2021

now

I0416 02:15:24.018678       1 load_processor.go:668] loader-28d402-ts.inventory.otpsms, batchId:1, startOffset:4097
I0416 02:20:23.229203       1 load_processor.go:724] loader-28d402-ts.inventory.otpsms, load staging

This was seen to happen in the first batch being processed when the loader pod was created.
Loader pod is only handling one topic.

Bug has come after the recent schema call optimizations.
It used to finish and move to load staging after start in milliseconds, now it is taking minutes.

before

I0401 08:05:12.574673       1 load_processor.go:739] loader-28d402-ts.inventory.otpsms, batchId:1, size:16389: processing...
I0401 08:05:12.574702       1 load_processor.go:646] loader-28d402-ts.inventory.otpsms, batchId:1, startOffset:57150
I0401 08:05:13.119588       1 load_processor.go:701] loader-28d402-ts.inventory.otpsms, load staging
@alok87 alok87 added the bug Something isn't working label Apr 16, 2021
@alok87 alok87 changed the title Loader speed: Slowness between processing start and the start of load staging Loader speed: Slowness in querying information schema Apr 16, 2021
@alok87 alok87 changed the title Loader speed: Slowness in querying information schema Loader speed: Slowness between processing start and the start of load staging Apr 16, 2021
@alok87
Copy link
Contributor Author

alok87 commented Apr 16, 2021

Slowness is because querying the table schema is taking many minutes to complete:

tableSchema = `SELECT
  f.attname AS name,
  pg_catalog.format_type(f.atttypid,f.atttypmod) AS col_type,
  CASE
      WHEN f.atthasdef = 't' THEN d.adsrc
      ELSE ''
  END AS default_val,
  f.attnotnull AS not_null,
  p.contype IS NOT NULL AND p.contype = 'p' AS primary_key,
  f.attisdistkey AS dist_key,
  f.attsortkeyord AS sort_ord
FROM pg_attribute f
  JOIN pg_class c ON c.oid = f.attrelid
  LEFT JOIN pg_attrdef d ON d.adrelid = c.oid AND d.adnum = f.attnum
  LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
  LEFT JOIN pg_constraint p ON p.conrelid = c.oid AND f.attnum = ANY (p.conkey)
WHERE c.relkind = 'r'::char
    AND n.nspname = '%s'  -- Replace with schema name
    AND c.relname = '%s'  -- Replace with table name
     AND f.attnum > 0 ORDER BY f.attnum;`
)

https://github.com/practo/tipoca-stream/blob/master/redshiftsink/pkg/redshift/redshift.go#L56-L76

Please fix this.

@alok87
Copy link
Contributor Author

alok87 commented Apr 29, 2021

Botteneck for loader performance is the Redshift load. If 1000s of queries are occupying redshift with load and read then Redshift just cannot load fast. Check the below graph, the dip happens when all the loads have stopped and only one table is allowed to load. #207 has more info.

Screenshot 2021-04-29 at 3 58 28 PM

@justjkk
Copy link
Contributor

justjkk commented Apr 29, 2021

Lets experiment with compression and may be changing the data format in S3 as specified in https://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html

@alok87
Copy link
Contributor Author

alok87 commented May 3, 2021

Compression is live. Now experimenting with

  1. suggestions mentioned here https://segment.com/blog/speed-up-redshift-syncs/
  2. CSV instead of json
  3. PARQUET, AVRO?

@alok87 alok87 changed the title Loader speed: Slowness between processing start and the start of load staging Loader speed: Query and DDL optimization May 5, 2021
@alok87 alok87 added performance Monitoring, Metrics, Logs, Benchmarks and removed bug Something isn't working labels May 5, 2021
@alok87
Copy link
Contributor Author

alok87 commented May 5, 2021

This issue was for DDL optimizations, loader speed is being tracked actually in this #186

@alok87 alok87 added the p1 priority 1, do it ASAP label May 5, 2021
@alok87
Copy link
Contributor Author

alok87 commented May 14, 2021

Closing this with #204 , fix in #223

@alok87 alok87 closed this as completed May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1 priority 1, do it ASAP performance Monitoring, Metrics, Logs, Benchmarks
Projects
None yet
Development

No branches or pull requests

2 participants