Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT-2469 Add cache when using --step option #606

Merged
merged 16 commits into from
Feb 15, 2022

Conversation

VDubber
Copy link
Contributor

@VDubber VDubber commented Feb 7, 2022

The --use-dependencies-cache was replaced by "always on" cache when the --step flag is being used. If desired, it can be turned off with --no-cache. A file path to a separate cache can be supplied with --cache-path.

Updated

Added --use-dependencies-cache flag to the collect command to increase development speed on long-running steps.

A customer is looking for a way to "skip" a step(s) that is depended upon by the target step. This implementation uses existing data from the .j1-integration directory to populate the the .j1-cache if no filepath is specified.

Looking for final feedback.

Added --ignore-step-dependencies flag to the `collect` command to increase development speed on long-running steps.
@VDubber VDubber requested review from ceelias and ndowmon February 7, 2022 19:53
@VDubber
Copy link
Contributor Author

VDubber commented Feb 7, 2022

Running these changes within graph-signal-sciences produces the following output:

image

image

Copy link
Contributor

@ndowmon ndowmon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sam, this is an awesome effort! I did uncover some really tricky behavior that I think we should definitely discuss - getting this one done is definitely more complex than I had originally thought. Let's set up some time to chat!

docs/integrations/development.md Outdated Show resolved Hide resolved
if (options.useStorageForIgnoredStepDependencies) {
const loadedEntities: string[] = [];
for (const stepId of config?.dependentSteps ?? []) {
const path = getRootStorageAbsolutePath(`graph/${stepId}/entities`);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to also load relationships from the same directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a limit on my experience with large integrations. If a step (A) depends on a step (B) with relationships, does A ever query or rely on the relationships?
For the sake of speed I didn't ingest relationship from disk.
tl;dr - the reason of speed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically there is an interface to iterate relationships (jobState.iterateRelationships), though it isn't used much. Loading relationships would also give us guarantees that we aren't using duplicate keys. Again, not a huge issue, but probably worth doing.

Do you have a sense of how long loading these entities and relationships is taking?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more Q: what's the user experience if there is no graph/${stepId}/entities directory? Can you post the message printed by the logger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

So if no entities or relationships are found the above is shown. And now that the code is using the metric magic, it looks like 6 mils to load a single entity from disk. The next step, which hits the API, is 220 mils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

This is what the results say. We could introduce a new status: cache_miss?

Replaced --use-dependencies-cache flag to the `collect` command. Changes simplify execution and improve developer experience.
@VDubber VDubber marked this pull request as ready for review February 10, 2022 21:03
@VDubber VDubber requested a review from ndowmon February 10, 2022 21:03
@VDubber VDubber changed the title INT-2469 --ignore-step-dependencies flag INT-2469 --use-dependencies-cache flag Feb 10, 2022
ndowmon
ndowmon previously approved these changes Feb 10, 2022
Copy link
Contributor

@ndowmon ndowmon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much better! Great work and thanks for the screenshots / tests.

I've approved with some comments, would love to see what you have to say about them but overall I'm good with this.

packages/integration-sdk-cli/src/commands/collect.ts Outdated Show resolved Hide resolved
packages/integration-sdk-cli/src/commands/collect.ts Outdated Show resolved Hide resolved
packages/integration-sdk-cli/src/commands/collect.ts Outdated Show resolved Hide resolved
Comment on lines 1176 to 1177
3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a
filepath to the previously created .j1-cache (most commonly `./`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking through the file system here... Seems that I can also do the following as many times as I want?

  1. Execute collection command without the --use-dependencies-cache option to gather data in .j1-integration
  2. Execute collection command with --step and --use-dependencies-cache option without specifying a filepath.
    This will cause the .j1-integration data to populate the .j1-cache.
  3. (repeat step two ad infinitum)

Is it true that I can continue to repeat step 2 over and over and it will only ever actually re-ingest the --step fetch-users step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndowmon Yes. if you repeat step 2 it will copy over the .j1-integration data over and over and only execute the fetch-users step. If the .j1-integration data goes sideways, you'll end up with a cache that is busted. Hence the suggestion to specify the filepath - then it'll stop trying to create the cache.

@ceelias It isn't clear. I thought about a command as well but that seemed too much. I imagine caches being self-maintaining. Not sure if that's the correct approach. Right now the cache creation destroys the previous cache. We could do more of an update, maybe?.... That might be tricky to get right.

Copy link
Contributor Author

@VDubber VDubber Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndowmon @ceelias
Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.

The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?

On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...

--
I might be in cacheland for too long now and my mind is skewed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 hm, that is an interesting idea... although I do think you may be stuck in cacheland 😉

My concern with defaulting to --no-cache is that differences in upstream dependencies may (or probably should) cause the chosen --step to behave differently. I still think it's best to refresh the dependent steps before running the new step.

*/
async function copyToCache() {
const graphDirectory = path.join(getRootStorageDirectory(), 'graph');
if (fs.ensureDir(graphDirectory)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is an async function? See: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/fs-extra/index.d.ts#L45

Can we install @types/fs-extra as a dev dep?

Copy link
Contributor

@ceelias ceelias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really solid work! Thanks for working through this! I left a fix on a typo and a few thought provoking comments


#### `j1-integration sync`
###### And example of the expected cache structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is a typo here.

Suggested change
###### And example of the expected cache structure
###### An example of the expected cache structure

Comment on lines 1176 to 1177
3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a
filepath to the previously created .j1-cache (most commonly `./`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?

},
});
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but what happens in the case that the integration wasn't run yet and there is no data to copy over to the cache? Might be worth a test here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a step is tagged as using the cache but fails to load any entities/relationships, it marks that step as a failure. I'll see if I can get that in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test and it caused the other tests to fail. I may come back to this at another time.

@@ -30,15 +33,34 @@ export function collect() {
collector,
[],
)
.option(
'-C, --use-dependencies-cache [filePath]',
'Loads cache for the dependencies required by the step(s) specified in --step option. Execution of these steps is skipped. Data found in .j1-integration is used if no filepath is provided.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More so for curiousity but we have the ability to create multiple dependency graphs. Have you tried to run that use case/does your code account for that? Maybe something we can address in a later PR if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. I'd love to see that in action!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #619 to track

- Improved logging
- Installed @types/fs-extra
- Removed need for JUPITERONE_CACHE_DIRECTORY env var
@VDubber VDubber requested review from austinkelleher, ceelias and ndowmon and removed request for ceelias February 11, 2022 21:51
###### An example of the expected cache structure

```
.j1-cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to go back to the integration-template project and update the .gitignore to include .j1-cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VDubber We'll probably want to update this to reflect the new /graph nesting.

ndowmon
ndowmon previously approved these changes Feb 14, 2022
Copy link
Contributor

@aiwilliams aiwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.

The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?

On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...

I think this is an excellent idea and should be where we go in the next iteration of this excellent feature.

###### An example of the expected cache structure

```
.j1-cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was .j1-integration-cache considered? I think it would be nice to see it right next to .j1-integration in file listings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @VDubber and I talked about this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great minds with the same idea. I'll make it happen.

@@ -0,0 +1,3 @@
export default function validateInvocation() {
return 'loaded';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, validateInvocation function return values are not used. This may have been copied from other integration fixtures, but I think it would be good at some point to make none of them return something since those values are not meaningful, to avoid any confusion.

Copy link
Contributor

@austinkelleher austinkelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @VDubber! I left some comments. I know that you still have a few changes that you're working on, but I supplied my initial review.

@@ -1143,14 +1144,54 @@ For convenience, steps can allow be provided as a comma delimited list.

ex: `j1-integration collect --step step-fetch-users,step-fetch-groups`

###### `--ignore-step-dependencies`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

integrations. When no filepath is specified, an attempt to create a cache is
made by copying the contents of `./.j1-integrations/graph` directory to
`./.j1-cache`. The structure of the cache follows a similar format as the
.j1-integration data storage, as described [here](#data-collection).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor, but we wrapped other instances of ".j1-integration" in backticks.

Suggested change
.j1-integration data storage, as described [here](#data-collection).
`.j1-integration` data storage, as described [here](#data-collection).

###### An example of the expected cache structure

```
.j1-cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VDubber We'll probably want to update this to reflect the new /graph nesting.

const sourceGraphDirectory = path.join(getRootStorageDirectory(), 'graph');
const destinationGraphDirectory = path.join(getRootCacheDirectory(), 'graph');

if (fs.pathExistsSync(sourceGraphDirectory)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to be async.

Suggested change
if (fs.pathExistsSync(sourceGraphDirectory)) {
if (await fs.pathExists(sourceGraphDirectory)) {

const { jobState, logger } = context;

let entitiesCount = 0;
await walkDirectory({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to replace this with iterateParsedGraphFiles which is exported from packages/integration-sdk-runtime/src/fileSystem.ts.

Cleanup suggestion...

There is a function called iterateParsedGraphFiles, which is exported from packages/integration-sdk-runtime/src/fileSystem.ts. I think we could expose two new functions:

  • iteratedParsedEntityGraphFiles
  • iteratedParsedRelationshipGraphFiles

Example impl:

export function iteratedParsedEntityGraphFiles(
  iteratee: (entities: Entity[]) => Promise<void>,
  graphPath?: string
) {
  return iterateParsedGraphFiles(async (data) => {
    if (data.entities) await iteratee(data.entities);
  }, graphPath);
}

Example usage:

await iteratedParsedEntityGraphFiles(
  async (entities) => {
    entitiesCount += entities.length;
    await jobState.addEntities(entities);
    status = StepResultStatus.CACHED;
  },
  entitiesPath
);


if (entitiesCount || relationshipCount) {
logger.info(
`Loaded ${entitiesCount} entities and ${relationshipCount} relationship(s) from cache.`,
Copy link
Contributor

@austinkelleher austinkelleher Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

It is a good practice to place all variables in the logger function's first arg instead of using string interpolation. String interpolated logs are slow to search.

Example of a fast log search term:

Loaded entities and relationship(s) from cache.

Example of a slow search term for which we will need to use the log search "contains" operator to find all instances of our log message. E.g. ~= "relationship(s) from cache."

Loaded 10 entities and 10 relationship(s) from cache.

Suggestion:

logger.info({
  entitiesCount,
  relationshipCount
}, 'Loaded entities and relationship(s) from cache.');


if (dependenciesCache?.enabled) {
ctx.logger.info(
`Using dependencies cache found at ${dependenciesCache.filepath}`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback about logging.

Suggested change
`Using dependencies cache found at ${dependenciesCache.filepath}`,
{ cacheFilePath: dependenciesCache.filePath }, 'Using dependencies cache',

for (const stepId of allStepIds) {
const originalValue = originalEnabledRecord[stepId] ?? {};
if (stepsToRun.includes(stepId)) {
enabledRecord[stepId] = {
...originalValue,
disabled: false,
...(dependenciesCache?.enabled &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind extracting this out into a function? I think it will be a bit easier for readers to understand.

@@ -17,12 +17,12 @@ import { readGraphObjectFile } from './storage/FileSystemGraphObjectStore/indice
const brotliCompress = promisify(zlib.brotliCompress);
const brotliDecompress = promisify(zlib.brotliDecompress);

export const DEFAULT_CACHE_DIRECTORY_NAME = '.j1-integration';
export const DEFAULT_STORAGE_DIRECTORY_NAME = '.j1-integration';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

###### An example of the expected cache structure

```
.j1-cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @VDubber and I talked about this too.

Introduce --no-cache option and --cache-path option.

If a cache is not available for a step, it executes it like normal.
@VDubber VDubber changed the title INT-2469 --use-dependencies-cache flag INT-2469 Add cache when using --step option Feb 15, 2022
@VDubber VDubber merged commit 1280faf into main Feb 15, 2022
@VDubber VDubber deleted the INT-2469-implement-ignore-step-dependencies branch February 15, 2022 22:56
@VDubber VDubber linked an issue Feb 17, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement CLI flag --ignore-step-dependencies
5 participants