INT-2469 Add cache when using --step option #606

VDubber · 2022-02-07T19:53:20Z

The --use-dependencies-cache was replaced by "always on" cache when the --step flag is being used. If desired, it can be turned off with --no-cache. A file path to a separate cache can be supplied with --cache-path.

Updated

Added --use-dependencies-cache flag to the collect command to increase development speed on long-running steps.

A customer is looking for a way to "skip" a step(s) that is depended upon by the target step. This implementation uses existing data from the .j1-integration directory to populate the the .j1-cache if no filepath is specified.

Looking for final feedback.

Added --ignore-step-dependencies flag to the `collect` command to increase development speed on long-running steps.

VDubber · 2022-02-07T19:56:42Z

Running these changes within graph-signal-sciences produces the following output:

ndowmon

Sam, this is an awesome effort! I did uncover some really tricky behavior that I think we should definitely discuss - getting this one done is definitely more complex than I had originally thought. Let's set up some time to chat!

docs/integrations/development.md

packages/integration-sdk-runtime/src/execution/executeIntegration.ts

ndowmon · 2022-02-08T14:44:10Z

packages/integration-sdk-runtime/src/execution/executeIntegration.ts

+    if (options.useStorageForIgnoredStepDependencies) {
+      const loadedEntities: string[] = [];
+      for (const stepId of config?.dependentSteps ?? []) {
+        const path = getRootStorageAbsolutePath(`graph/${stepId}/entities`);


Is there any reason not to also load relationships from the same directory?

Probably a limit on my experience with large integrations. If a step (A) depends on a step (B) with relationships, does A ever query or rely on the relationships?
For the sake of speed I didn't ingest relationship from disk.
tl;dr - the reason of speed

Technically there is an interface to iterate relationships (jobState.iterateRelationships), though it isn't used much. Loading relationships would also give us guarantees that we aren't using duplicate keys. Again, not a huge issue, but probably worth doing.

Do you have a sense of how long loading these entities and relationships is taking?

One more Q: what's the user experience if there is no graph/${stepId}/entities directory? Can you post the message printed by the logger?

So if no entities or relationships are found the above is shown. And now that the code is using the metric magic, it looks like 6 mils to load a single entity from disk. The next step, which hits the API, is 220 mils.

This is what the results say. We could introduce a new status: cache_miss?

packages/integration-sdk-runtime/src/execution/executeIntegration.ts

Replaced --use-dependencies-cache flag to the `collect` command. Changes simplify execution and improve developer experience.

# Conflicts: # yarn.lock

…k in.

docs/integrations/development.md

ndowmon

So much better! Great work and thanks for the screenshots / tests.

I've approved with some comments, would love to see what you have to say about them but overall I'm good with this.

packages/integration-sdk-cli/src/commands/collect.ts

ndowmon · 2022-02-10T21:51:47Z

docs/integrations/development.md

+3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a 
+filepath to the previously created .j1-cache (most commonly `./`).


I'm thinking through the file system here... Seems that I can also do the following as many times as I want?

Execute collection command without the --use-dependencies-cache option to gather data in .j1-integration

Execute collection command with --step and --use-dependencies-cache option without specifying a filepath.
This will cause the .j1-integration data to populate the .j1-cache.

(repeat step two ad infinitum)

Is it true that I can continue to repeat step 2 over and over and it will only ever actually re-ingest the --step fetch-users step?

I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?

@ndowmon Yes. if you repeat step 2 it will copy over the .j1-integration data over and over and only execute the fetch-users step. If the .j1-integration data goes sideways, you'll end up with a cache that is busted. Hence the suggestion to specify the filepath - then it'll stop trying to create the cache.

@ceelias It isn't clear. I thought about a command as well but that seemed too much. I imagine caches being self-maintaining. Not sure if that's the correct approach. Right now the cache creation destroys the previous cache. We could do more of an update, maybe?.... That might be tricky to get right.

@ndowmon @ceelias
Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.

The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?

On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...

--
I might be in cacheland for too long now and my mind is skewed.

🤔 hm, that is an interesting idea... although I do think you may be stuck in cacheland 😉

My concern with defaulting to --no-cache is that differences in upstream dependencies may (or probably should) cause the chosen --step to behave differently. I still think it's best to refresh the dependent steps before running the new step.

packages/integration-sdk-runtime/src/execution/dependencyGraph.ts

austinkelleher · 2022-02-11T00:49:51Z

packages/integration-sdk-cli/src/commands/collect.ts

+ */
+async function copyToCache() {
+  const graphDirectory = path.join(getRootStorageDirectory(), 'graph');
+  if (fs.ensureDir(graphDirectory)) {


This seems like it is an async function? See: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/fs-extra/index.d.ts#L45

Can we install @types/fs-extra as a dev dep?

ceelias

Really solid work! Thanks for working through this! I left a fix on a typo and a few thought provoking comments

ceelias · 2022-02-11T15:00:56Z

docs/integrations/development.md


-#### `j1-integration sync`
+###### And example of the expected cache structure


Think this is a typo here.

Suggested change

###### And example of the expected cache structure

###### An example of the expected cache structure

ceelias · 2022-02-11T15:05:03Z

docs/integrations/development.md

+3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a 
+filepath to the previously created .j1-cache (most commonly `./`).


I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?

ceelias · 2022-02-11T15:20:40Z

packages/integration-sdk-cli/src/__tests__/cli.test.ts

+      },
+    });
+  });
+


I may have missed it but what happens in the case that the integration wasn't run yet and there is no data to copy over to the cache? Might be worth a test here.

If a step is tagged as using the cache but fails to load any entities/relationships, it marks that step as a failure. I'll see if I can get that in here.

I added a test and it caused the other tests to fail. I may come back to this at another time.

ceelias · 2022-02-11T15:21:54Z

packages/integration-sdk-cli/src/commands/collect.ts

@@ -30,15 +33,34 @@ export function collect() {
      collector,
      [],
    )
+    .option(
+      '-C, --use-dependencies-cache [filePath]',
+      'Loads cache for the dependencies required by the step(s) specified in --step option. Execution of these steps is skipped. Data found in .j1-integration is used if no filepath is provided.',


More so for curiousity but we have the ability to create multiple dependency graphs. Have you tried to run that use case/does your code account for that? Maybe something we can address in a later PR if not.

I have not. I'd love to see that in action!

Opened #619 to track

- Improved logging - Installed @types/fs-extra - Removed need for JUPITERONE_CACHE_DIRECTORY env var

ndowmon · 2022-02-14T17:52:55Z

docs/integrations/development.md

+###### An example of the expected cache structure
+
+```
+.j1-cache/


You'll probably want to go back to the integration-template project and update the .gitignore to include .j1-cache.

@VDubber We'll probably want to update this to reflect the new /graph nesting.

aiwilliams

Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.

The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?

On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...

I think this is an excellent idea and should be where we go in the next iteration of this excellent feature.

aiwilliams · 2022-02-14T17:41:32Z

docs/integrations/development.md

+###### An example of the expected cache structure
+
+```
+.j1-cache/


Was .j1-integration-cache considered? I think it would be nice to see it right next to .j1-integration in file listings.

Agreed. @VDubber and I talked about this too.

Great minds with the same idea. I'll make it happen.

aiwilliams · 2022-02-14T17:43:42Z

...k-private-test-utils/__fixtures__/instanceWithDependentCachedSteps/src/validateInvocation.ts

@@ -0,0 +1,3 @@
+export default function validateInvocation() {
+  return 'loaded';


IIRC, validateInvocation function return values are not used. This may have been copied from other integration fixtures, but I think it would be good at some point to make none of them return something since those values are not meaningful, to avoid any confusion.

austinkelleher

Nice work @VDubber! I left some comments. I know that you still have a few changes that you're working on, but I supplied my initial review.

austinkelleher · 2022-02-14T19:08:13Z

docs/integrations/development.md

@@ -1143,14 +1144,54 @@ For convenience, steps can allow be provided as a comma delimited list.

 ex: `j1-integration collect --step step-fetch-users,step-fetch-groups`

-###### `--ignore-step-dependencies`


austinkelleher · 2022-02-14T19:10:11Z

docs/integrations/development.md

+integrations. When no filepath is specified, an attempt to create a cache is
+made by copying the contents of `./.j1-integrations/graph` directory to
+`./.j1-cache`. The structure of the cache follows a similar format as the
+.j1-integration data storage, as described [here](#data-collection).


Super minor, but we wrapped other instances of ".j1-integration" in backticks.

Suggested change

.j1-integration data storage, as described [here](#data-collection).

`.j1-integration` data storage, as described [here](#data-collection).

austinkelleher · 2022-02-14T19:11:49Z

docs/integrations/development.md

+###### An example of the expected cache structure
+
+```
+.j1-cache/


@VDubber We'll probably want to update this to reflect the new /graph nesting.

austinkelleher · 2022-02-15T13:46:52Z

packages/integration-sdk-cli/src/commands/collect.ts

+  const sourceGraphDirectory = path.join(getRootStorageDirectory(), 'graph');
+  const destinationGraphDirectory = path.join(getRootCacheDirectory(), 'graph');
+
+  if (fs.pathExistsSync(sourceGraphDirectory)) {


We should change this to be async.

Suggested change

if (fs.pathExistsSync(sourceGraphDirectory)) {

if (await fs.pathExists(sourceGraphDirectory)) {

austinkelleher · 2022-02-15T14:00:22Z

packages/integration-sdk-runtime/src/execution/dependencyGraph.ts

+      const { jobState, logger } = context;
+
+      let entitiesCount = 0;
+      await walkDirectory({


We should be able to replace this with iterateParsedGraphFiles which is exported from packages/integration-sdk-runtime/src/fileSystem.ts.

Cleanup suggestion...

There is a function called iterateParsedGraphFiles, which is exported from packages/integration-sdk-runtime/src/fileSystem.ts. I think we could expose two new functions:

iteratedParsedEntityGraphFiles

iteratedParsedRelationshipGraphFiles

Example impl:

export function iteratedParsedEntityGraphFiles( iteratee: (entities: Entity[]) => Promise<void>, graphPath?: string ) { return iterateParsedGraphFiles(async (data) => { if (data.entities) await iteratee(data.entities); }, graphPath); }

Example usage:

await iteratedParsedEntityGraphFiles( async (entities) => { entitiesCount += entities.length; await jobState.addEntities(entities); status = StepResultStatus.CACHED; }, entitiesPath );

austinkelleher · 2022-02-15T14:05:42Z

packages/integration-sdk-runtime/src/execution/dependencyGraph.ts

+
+      if (entitiesCount || relationshipCount) {
+        logger.info(
+          `Loaded ${entitiesCount} entities and  ${relationshipCount} relationship(s) from cache.`,


Minor:

It is a good practice to place all variables in the logger function's first arg instead of using string interpolation. String interpolated logs are slow to search.

Example of a fast log search term:

Loaded entities and relationship(s) from cache.

Example of a slow search term for which we will need to use the log search "contains" operator to find all instances of our log message. E.g. ~= "relationship(s) from cache."

Loaded 10 entities and 10 relationship(s) from cache.

Suggestion:

logger.info({ entitiesCount, relationshipCount }, 'Loaded entities and relationship(s) from cache.');

austinkelleher · 2022-02-15T14:07:22Z

packages/integration-sdk-runtime/src/execution/step.ts

+
+        if (dependenciesCache?.enabled) {
+          ctx.logger.info(
+            `Using dependencies cache found at ${dependenciesCache.filepath}`,


Same feedback about logging.

Suggested change

`Using dependencies cache found at ${dependenciesCache.filepath}`,

{ cacheFilePath: dependenciesCache.filePath }, 'Using dependencies cache',

austinkelleher · 2022-02-15T14:08:18Z

packages/integration-sdk-runtime/src/execution/step.ts

        for (const stepId of allStepIds) {
          const originalValue = originalEnabledRecord[stepId] ?? {};
          if (stepsToRun.includes(stepId)) {
            enabledRecord[stepId] = {
              ...originalValue,
              disabled: false,
+              ...(dependenciesCache?.enabled &&


Would you mind extracting this out into a function? I think it will be a bit easier for readers to understand.

austinkelleher · 2022-02-15T14:08:31Z

packages/integration-sdk-runtime/src/fileSystem.ts

@@ -17,12 +17,12 @@ import { readGraphObjectFile } from './storage/FileSystemGraphObjectStore/indice
 const brotliCompress = promisify(zlib.brotliCompress);
 const brotliDecompress = promisify(zlib.brotliDecompress);

-export const DEFAULT_CACHE_DIRECTORY_NAME = '.j1-integration';
+export const DEFAULT_STORAGE_DIRECTORY_NAME = '.j1-integration';


austinkelleher · 2022-02-15T14:10:19Z

docs/integrations/development.md

+###### An example of the expected cache structure
+
+```
+.j1-cache/


Agreed. @VDubber and I talked about this too.

Introduce --no-cache option and --cache-path option. If a cache is not available for a step, it executes it like normal.

- rename .j1-cache to .j1-integration-cache - Additional cleanup

…shipGraphFiles

INT-2469

506870f

Added --ignore-step-dependencies flag to the `collect` command to increase development speed on long-running steps.

VDubber requested review from ceelias and ndowmon February 7, 2022 19:53

ndowmon reviewed Feb 8, 2022

View reviewed changes

packages/integration-sdk-runtime/src/execution/executeIntegration.ts Outdated Show resolved Hide resolved

VDubber added 4 commits February 10, 2022 13:26

INT-2469

ae597dd

Replaced --use-dependencies-cache flag to the `collect` command. Changes simplify execution and improve developer experience.

Changes color of cached status

503af54

Merge branch 'main' into INT-2469-implement-ignore-step-dependencies

b130c2e

# Conflicts: # yarn.lock

Merged main and resolved conflicts with yarn.lock. Added fs-extra bac…

e45446d

…k in.

VDubber commented Feb 10, 2022

View reviewed changes

docs/integrations/development.md Show resolved Hide resolved

Self-review cleanup

493edb6

VDubber marked this pull request as ready for review February 10, 2022 21:03

VDubber requested a review from ndowmon February 10, 2022 21:03

VDubber changed the title ~~INT-2469 --ignore-step-dependencies flag~~ INT-2469 --use-dependencies-cache flag Feb 10, 2022

ndowmon previously approved these changes Feb 10, 2022

View reviewed changes

austinkelleher reviewed Feb 11, 2022

View reviewed changes

ceelias reviewed Feb 11, 2022

View reviewed changes

ndowmon mentioned this pull request Feb 11, 2022

Allow j1-integration collect --use-dependency-cache to support multiple dependency graphs #619

Open

Further refinements to dependencies cache.

1fc882f

- Improved logging - Installed @types/fs-extra - Removed need for JUPITERONE_CACHE_DIRECTORY env var

VDubber dismissed ndowmon’s stale review via 1fc882f February 11, 2022 21:40

VDubber requested review from austinkelleher, ceelias and ndowmon and removed request for ceelias February 11, 2022 21:51

VDubber added 3 commits February 14, 2022 08:55

Prettier, remove unnecessary flush.

f5c94e3

Included /graph inside the j1-cache to allow for future types of cache.

f670b57

Merge branch 'main' into INT-2469-implement-ignore-step-dependencies

fc3746e

ndowmon reviewed Feb 14, 2022

View reviewed changes

ndowmon previously approved these changes Feb 14, 2022

View reviewed changes

aiwilliams reviewed Feb 14, 2022

View reviewed changes

Align async/await approach.

b379bd7

austinkelleher reviewed Feb 15, 2022

View reviewed changes

Flip option from --use-dependencies-cache to default cache always on.

7b24e4c

Introduce --no-cache option and --cache-path option. If a cache is not available for a step, it executes it like normal.

VDubber dismissed ndowmon’s stale review via 7b24e4c February 15, 2022 17:15

VDubber added 3 commits February 15, 2022 10:36

Changes from review feedback

68e89b5

- rename .j1-cache to .j1-integration-cache - Additional cleanup

Fix failing test because of filepath issues.

50301aa

Improved to use iterateParsedEntityGraphFiles & iterateParsedRelation…

7902f78

…shipGraphFiles

VDubber requested review from aiwilliams and austinkelleher February 15, 2022 18:04

austinkelleher approved these changes Feb 15, 2022

View reviewed changes

VDubber changed the title ~~INT-2469 --use-dependencies-cache flag~~ INT-2469 Add cache when using --step option Feb 15, 2022

Merge branch 'main' into INT-2469-implement-ignore-step-dependencies

3c15357

VDubber merged commit 1280faf into main Feb 15, 2022

VDubber deleted the INT-2469-implement-ignore-step-dependencies branch February 15, 2022 22:56

VDubber linked an issue Feb 17, 2022 that may be closed by this pull request

Implement CLI flag --ignore-step-dependencies #602

Closed

		3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a
		filepath to the previously created .j1-cache (most commonly `./`).


		#### `j1-integration sync`
		###### And example of the expected cache structure

	###### And example of the expected cache structure
	###### An example of the expected cache structure

		@@ -0,0 +1,3 @@
		export default function validateInvocation() {
		return 'loaded';

		@@ -1143,14 +1144,54 @@ For convenience, steps can allow be provided as a comma delimited list.

		ex: `j1-integration collect --step step-fetch-users,step-fetch-groups`

		###### `--ignore-step-dependencies`

	.j1-integration data storage, as described [here](#data-collection).
	`.j1-integration` data storage, as described [here](#data-collection).

	if (fs.pathExistsSync(sourceGraphDirectory)) {
	if (await fs.pathExists(sourceGraphDirectory)) {

	`Using dependencies cache found at ${dependenciesCache.filepath}`,
	{ cacheFilePath: dependenciesCache.filePath }, 'Using dependencies cache',

INT-2469 Add cache when using --step option #606

INT-2469 Add cache when using --step option #606

Conversation

VDubber commented Feb 7, 2022 • edited Loading

Updated

VDubber commented Feb 7, 2022 • edited Loading

ndowmon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndowmon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VDubber Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ceelias left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aiwilliams left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinkelleher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinkelleher Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VDubber commented Feb 7, 2022 •

edited

Loading

VDubber commented Feb 7, 2022 •

edited

Loading

VDubber Feb 11, 2022 •

edited

Loading

austinkelleher Feb 15, 2022 •

edited

Loading