-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INT-2469 Add cache when using --step option #606
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sam, this is an awesome effort! I did uncover some really tricky behavior that I think we should definitely discuss - getting this one done is definitely more complex than I had originally thought. Let's set up some time to chat!
packages/integration-sdk-runtime/src/execution/executeIntegration.ts
Outdated
Show resolved
Hide resolved
if (options.useStorageForIgnoredStepDependencies) { | ||
const loadedEntities: string[] = []; | ||
for (const stepId of config?.dependentSteps ?? []) { | ||
const path = getRootStorageAbsolutePath(`graph/${stepId}/entities`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason not to also load relationships
from the same directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a limit on my experience with large integrations. If a step (A) depends on a step (B) with relationships, does A ever query or rely on the relationships?
For the sake of speed I didn't ingest relationship from disk.
tl;dr - the reason of speed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically there is an interface to iterate relationships (jobState.iterateRelationships
), though it isn't used much. Loading relationships would also give us guarantees that we aren't using duplicate keys. Again, not a huge issue, but probably worth doing.
Do you have a sense of how long loading these entities and relationships is taking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more Q: what's the user experience if there is no graph/${stepId}/entities
directory? Can you post the message printed by the logger?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packages/integration-sdk-runtime/src/execution/executeIntegration.ts
Outdated
Show resolved
Hide resolved
packages/integration-sdk-runtime/src/execution/executeIntegration.ts
Outdated
Show resolved
Hide resolved
# Conflicts: # yarn.lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much better! Great work and thanks for the screenshots / tests.
I've approved with some comments, would love to see what you have to say about them but overall I'm good with this.
docs/integrations/development.md
Outdated
3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a | ||
filepath to the previously created .j1-cache (most commonly `./`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking through the file system here... Seems that I can also do the following as many times as I want?
- Execute collection command without the
--use-dependencies-cache
option to gather data in .j1-integration - Execute collection command with
--step
and--use-dependencies-cache
option without specifying a filepath.
This will cause the .j1-integration data to populate the .j1-cache. - (repeat step two ad infinitum)
Is it true that I can continue to repeat step 2 over and over and it will only ever actually re-ingest the --step fetch-users
step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndowmon Yes. if you repeat step 2 it will copy over the .j1-integration data over and over and only execute the fetch-users step. If the .j1-integration data goes sideways, you'll end up with a cache that is busted. Hence the suggestion to specify the filepath - then it'll stop trying to create the cache.
@ceelias It isn't clear. I thought about a command as well but that seemed too much. I imagine caches being self-maintaining. Not sure if that's the correct approach. Right now the cache creation destroys the previous cache. We could do more of an update, maybe?.... That might be tricky to get right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndowmon @ceelias
Crazy idea: what if we turn on caching by default whenever the --step
flag is used? We could introduce a simple --no-cache
flag and -cache-path
flag. I think this make the cache helpful and out of the way until you need to deal with it.
The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?
On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...
--
I might be in cacheland for too long now and my mind is skewed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 hm, that is an interesting idea... although I do think you may be stuck in cacheland 😉
My concern with defaulting to --no-cache
is that differences in upstream dependencies may (or probably should) cause the chosen --step
to behave differently. I still think it's best to refresh the dependent steps before running the new step.
packages/integration-sdk-runtime/src/execution/dependencyGraph.ts
Outdated
Show resolved
Hide resolved
*/ | ||
async function copyToCache() { | ||
const graphDirectory = path.join(getRootStorageDirectory(), 'graph'); | ||
if (fs.ensureDir(graphDirectory)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it is an async function? See: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/fs-extra/index.d.ts#L45
Can we install @types/fs-extra
as a dev dep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really solid work! Thanks for working through this! I left a fix on a typo and a few thought provoking comments
docs/integrations/development.md
Outdated
|
||
#### `j1-integration sync` | ||
###### And example of the expected cache structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think this is a typo here.
###### And example of the expected cache structure | |
###### An example of the expected cache structure |
docs/integrations/development.md
Outdated
3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a | ||
filepath to the previously created .j1-cache (most commonly `./`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?
}, | ||
}); | ||
}); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it but what happens in the case that the integration wasn't run yet and there is no data to copy over to the cache? Might be worth a test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a step is tagged as using the cache but fails to load any entities/relationships, it marks that step as a failure. I'll see if I can get that in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test and it caused the other tests to fail. I may come back to this at another time.
@@ -30,15 +33,34 @@ export function collect() { | |||
collector, | |||
[], | |||
) | |||
.option( | |||
'-C, --use-dependencies-cache [filePath]', | |||
'Loads cache for the dependencies required by the step(s) specified in --step option. Execution of these steps is skipped. Data found in .j1-integration is used if no filepath is provided.', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More so for curiousity but we have the ability to create multiple dependency graphs. Have you tried to run that use case/does your code account for that? Maybe something we can address in a later PR if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not. I'd love to see that in action!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #619 to track
- Improved logging - Installed @types/fs-extra - Removed need for JUPITERONE_CACHE_DIRECTORY env var
docs/integrations/development.md
Outdated
###### An example of the expected cache structure | ||
|
||
``` | ||
.j1-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to go back to the integration-template
project and update the .gitignore
to include .j1-cache
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VDubber We'll probably want to update this to reflect the new /graph
nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.
The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?
On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...
I think this is an excellent idea and should be where we go in the next iteration of this excellent feature.
docs/integrations/development.md
Outdated
###### An example of the expected cache structure | ||
|
||
``` | ||
.j1-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was .j1-integration-cache
considered? I think it would be nice to see it right next to .j1-integration
in file listings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. @VDubber and I talked about this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great minds with the same idea. I'll make it happen.
@@ -0,0 +1,3 @@ | |||
export default function validateInvocation() { | |||
return 'loaded'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, validateInvocation
function return values are not used. This may have been copied from other integration fixtures, but I think it would be good at some point to make none of them return something since those values are not meaningful, to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @VDubber! I left some comments. I know that you still have a few changes that you're working on, but I supplied my initial review.
@@ -1143,14 +1144,54 @@ For convenience, steps can allow be provided as a comma delimited list. | |||
|
|||
ex: `j1-integration collect --step step-fetch-users,step-fetch-groups` | |||
|
|||
###### `--ignore-step-dependencies` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
docs/integrations/development.md
Outdated
integrations. When no filepath is specified, an attempt to create a cache is | ||
made by copying the contents of `./.j1-integrations/graph` directory to | ||
`./.j1-cache`. The structure of the cache follows a similar format as the | ||
.j1-integration data storage, as described [here](#data-collection). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor, but we wrapped other instances of ".j1-integration" in backticks.
.j1-integration data storage, as described [here](#data-collection). | |
`.j1-integration` data storage, as described [here](#data-collection). |
docs/integrations/development.md
Outdated
###### An example of the expected cache structure | ||
|
||
``` | ||
.j1-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VDubber We'll probably want to update this to reflect the new /graph
nesting.
const sourceGraphDirectory = path.join(getRootStorageDirectory(), 'graph'); | ||
const destinationGraphDirectory = path.join(getRootCacheDirectory(), 'graph'); | ||
|
||
if (fs.pathExistsSync(sourceGraphDirectory)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change this to be async.
if (fs.pathExistsSync(sourceGraphDirectory)) { | |
if (await fs.pathExists(sourceGraphDirectory)) { |
const { jobState, logger } = context; | ||
|
||
let entitiesCount = 0; | ||
await walkDirectory({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to replace this with iterateParsedGraphFiles
which is exported from packages/integration-sdk-runtime/src/fileSystem.ts
.
Cleanup suggestion...
There is a function called iterateParsedGraphFiles
, which is exported from packages/integration-sdk-runtime/src/fileSystem.ts
. I think we could expose two new functions:
iteratedParsedEntityGraphFiles
iteratedParsedRelationshipGraphFiles
Example impl:
export function iteratedParsedEntityGraphFiles(
iteratee: (entities: Entity[]) => Promise<void>,
graphPath?: string
) {
return iterateParsedGraphFiles(async (data) => {
if (data.entities) await iteratee(data.entities);
}, graphPath);
}
Example usage:
await iteratedParsedEntityGraphFiles(
async (entities) => {
entitiesCount += entities.length;
await jobState.addEntities(entities);
status = StepResultStatus.CACHED;
},
entitiesPath
);
|
||
if (entitiesCount || relationshipCount) { | ||
logger.info( | ||
`Loaded ${entitiesCount} entities and ${relationshipCount} relationship(s) from cache.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
It is a good practice to place all variables in the logger function's first arg instead of using string interpolation. String interpolated logs are slow to search.
Example of a fast log search term:
Loaded entities and relationship(s) from cache.
Example of a slow search term for which we will need to use the log search "contains" operator to find all instances of our log message. E.g. ~= "relationship(s) from cache."
Loaded 10 entities and 10 relationship(s) from cache.
Suggestion:
logger.info({
entitiesCount,
relationshipCount
}, 'Loaded entities and relationship(s) from cache.');
|
||
if (dependenciesCache?.enabled) { | ||
ctx.logger.info( | ||
`Using dependencies cache found at ${dependenciesCache.filepath}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback about logging.
`Using dependencies cache found at ${dependenciesCache.filepath}`, | |
{ cacheFilePath: dependenciesCache.filePath }, 'Using dependencies cache', |
for (const stepId of allStepIds) { | ||
const originalValue = originalEnabledRecord[stepId] ?? {}; | ||
if (stepsToRun.includes(stepId)) { | ||
enabledRecord[stepId] = { | ||
...originalValue, | ||
disabled: false, | ||
...(dependenciesCache?.enabled && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind extracting this out into a function? I think it will be a bit easier for readers to understand.
@@ -17,12 +17,12 @@ import { readGraphObjectFile } from './storage/FileSystemGraphObjectStore/indice | |||
const brotliCompress = promisify(zlib.brotliCompress); | |||
const brotliDecompress = promisify(zlib.brotliDecompress); | |||
|
|||
export const DEFAULT_CACHE_DIRECTORY_NAME = '.j1-integration'; | |||
export const DEFAULT_STORAGE_DIRECTORY_NAME = '.j1-integration'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
docs/integrations/development.md
Outdated
###### An example of the expected cache structure | ||
|
||
``` | ||
.j1-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. @VDubber and I talked about this too.
Introduce --no-cache option and --cache-path option. If a cache is not available for a step, it executes it like normal.
- rename .j1-cache to .j1-integration-cache - Additional cleanup
The --use-dependencies-cache was replaced by "always on" cache when the
--step
flag is being used. If desired, it can be turned off with--no-cache
. A file path to a separate cache can be supplied with--cache-path
.