Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INT-2469 Add cache when using --step option #606

Merged
merged 16 commits into from
Feb 15, 2022
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 54 additions & 23 deletions docs/integrations/development.md
Original file line number Diff line number Diff line change
Expand Up @@ -1051,7 +1051,7 @@ For convenience when developing locally, we will also look for a
Initially, the CLI will support a limited interface consisting of only three
commands: `collect`, `sync`, and `run`.

#### `j1-integration collect`
#### Command `j1-integration collect`
VDubber marked this conversation as resolved.
Show resolved Hide resolved

`j1-integration collect` will run the js framework locally to _only_ perform
data collection. The `collect` command is designed to work closely with the
Expand Down Expand Up @@ -1100,15 +1100,15 @@ integration expects to see set.

##### Options

###### `--module` or `-m`
###### Option `--module` or `-m`

If you prefer not to place your integration configuration in one of the
supported file paths, you can optionally specify the `--module` or `-m` option
and provide a path to your integration file.

ex: `j1-integration collect --module path/to/my/integration.ts`

###### `--instance` or `-i`
###### Option `--instance` or `-i`

If you are working with an existing integration instance and would prefer to
leverage the configuration field values from that instance, you can optionally
Expand All @@ -1120,15 +1120,15 @@ input some credentials or provide an `--api-key` option.

ex: `j1-integration collect --instance <integration instance id>`

###### `--api-key` or `-k`
###### Option `--api-key` or `-k`

For developers that have an API key or prefer to not input credentials, an
`--api-key` option can be specified to access the synchronization API.

ex:
`j1-integration collect --instance <integration instance id> --api-key <my api key>`

###### `--step` or `-s`
###### Option `--step` or `-s`

For larger integrations, a full collection run may take a long time. To help
address this, a `--step` option can be provided to selectively run a step along
Expand All @@ -1143,14 +1143,45 @@ For convenience, steps can allow be provided as a comma delimited list.

ex: `j1-integration collect --step step-fetch-users,step-fetch-groups`

###### `--ignore-step-dependencies`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

###### Option `--use-dependencies-cache` or `-C`

If you only want to run a single step or an explicit list of steps without
invoking the dependencies of those steps, you can do so via the
`--ignore-step-dependencies` flag. This is useful for speeding up testing by
utilizing the data that has already been collected and stored on disk.
Allows preceding steps to skip execution and instead load previously captured results from disk.
The intent of this is to increase development speed for new integrations.
When no filepath is specified, an attempt to create a cache is made by
copying the contents of `./.j1-integrations/graph` directory to `./.j1-cache`.
The structure of the cache follows a similar format as the .j1-integration data storage, as described [here](#data-collection).

#### `j1-integration sync`
###### And example of the expected cache structure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this is a typo here.

Suggested change
###### And example of the expected cache structure
###### An example of the expected cache structure

```
.j1-cache/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was .j1-integration-cache considered? I think it would be nice to see it right next to .j1-integration in file listings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. @VDubber and I talked about this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great minds with the same idea. I'll make it happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to go back to the integration-template project and update the .gitignore to include .j1-cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VDubber We'll probably want to update this to reflect the new /graph nesting.

/step-fetch-accounts
/entities/
11fa25fb-dfbf-43b8-a6e1-017ad369fe98.json
/step-fetch-users
/entities
9cb7bee4-c037-4041-83b7-d532488f26a3.json
96992893-898d-4cda-8129-4695b0323642.json
/relationships
8fcc6865-817d-4952-ac53-8248b357b5d8.json
```

ex: `j1-integration collect --step fetch-users --use-dependencies-cache` - Builds & uses cache from .j1-integration
ex: `j1-integration collect --step fetch-users --use-dependencies-cache ./` - Uses .j1-cache found in the root of the project
ex: `j1-integration collect --step fetch-users --use-dependencies-cache ./path-to-cache` - Uses .j1-cache found in the path specified

A common use pattern:
1. Execute collection command _without_ the `--use-dependencies-cache` option to gather data in .j1-integration
2. Execute collection command with `--step` and `--use-dependencies-cache` option _without_ specifying a filepath.
This will cause the .j1-integration data to populate the .j1-cache.
3. Execute collection command with `--step` and `--use-dependencies-cache` option specifying a
filepath to the previously created .j1-cache (most commonly `./`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking through the file system here... Seems that I can also do the following as many times as I want?

  1. Execute collection command without the --use-dependencies-cache option to gather data in .j1-integration
  2. Execute collection command with --step and --use-dependencies-cache option without specifying a filepath.
    This will cause the .j1-integration data to populate the .j1-cache.
  3. (repeat step two ad infinitum)

Is it true that I can continue to repeat step 2 over and over and it will only ever actually re-ingest the --step fetch-users step?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as Nick here. If this is true, I suppose running step 2 is essentially "refreshing the cache". I was going to suggest we add an option to clear and refresh the cache but sounds like maybe this would handle the refreshing part anyway. I'm just not sure how intuitive that is vs a separate flag needed to refresh the cache in here and otherwise it will default to using the cache that already lives in the directory. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndowmon Yes. if you repeat step 2 it will copy over the .j1-integration data over and over and only execute the fetch-users step. If the .j1-integration data goes sideways, you'll end up with a cache that is busted. Hence the suggestion to specify the filepath - then it'll stop trying to create the cache.

@ceelias It isn't clear. I thought about a command as well but that seemed too much. I imagine caches being self-maintaining. Not sure if that's the correct approach. Right now the cache creation destroys the previous cache. We could do more of an update, maybe?.... That might be tricky to get right.

Copy link
Contributor Author

@VDubber VDubber Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndowmon @ceelias
Crazy idea: what if we turn on caching by default whenever the --step flag is used? We could introduce a simple --no-cache flag and -cache-path flag. I think this make the cache helpful and out of the way until you need to deal with it.

The --step flag is already used to decrease developer wait time... Why not make it even better with the cache?

On the initial run, when there is no .j1-integration data, it doesn't utilize the cache...

--
I might be in cacheland for too long now and my mind is skewed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 hm, that is an interesting idea... although I do think you may be stuck in cacheland 😉

My concern with defaulting to --no-cache is that differences in upstream dependencies may (or probably should) cause the chosen --step to behave differently. I still think it's best to refresh the dependent steps before running the new step.



###### Option `--disable-schema-validation` or `-V`

Disables schema validation.

#### Command `j1-integration sync`

The `sync` command will validate data placed in the `.j1-integration/graph`
directory has been formatted correctly and later format the data to allow for
Expand Down Expand Up @@ -1202,35 +1233,35 @@ framework for developing integrations, let us know!

##### Options

###### `--module` or `-m`
###### Option `--module` or `-m`

Much like the `collect` command, you can optionally specify an `--module` or
`-m` option to specify the path to the integration configuration file.

###### `--instance` or `-i`
###### Option `--instance` or `-i`

For the `sync` command, an integration instance must be specified to know which
integration instance data the collected data should be associated with.

ex:
`j1-integration sync --instance <integration instance id> --api-key <my api key>`

###### `--api-key` or `-k`
###### Option `--api-key` or `-k`

Like the `collect` command, an API key can be optionally passed in to use for
synchronization.

ex:
`j1-integration sync --instance <integration instance id> --api-key <my api key>`

###### `--tail` or `-t`
###### Option `--tail` or `-t`

If provided this option poll the integration job to and display the status of
the job run. The polling will stop once the job was marked as complete.

ex: `j1-integration sync --instance <integration instance id> --tail`

#### `j1-integration run`
#### Command `j1-integration run`

The `j1-integration run` command combines the functionality of the `collect` and
`sync` commands, essentially running the commands back to back.
Expand All @@ -1250,7 +1281,7 @@ the
`https://api.us.jupiterone.io/synchronization/:integrationInstanceId/jobs/:jobId/events`
API.

#### `j1-integration visualize`
#### Command `j1-integration visualize`

The `j1-integration visualize` command reads JSON files from the
`.j1-integrations/graph` directory and generates a visualization of the data
Expand Down Expand Up @@ -1291,7 +1322,7 @@ files:
}
```

#### `j1-integration visualize-types`
#### Command `j1-integration visualize-types`

```
Usage: j1-integration visualize-types [options]
Expand All @@ -1308,7 +1339,7 @@ Options:
`j1-integration visualize-types` generates a [visjs](http://www.visjs.org) graph
based on the metadata defined in each step.

#### `j1-integration document`
#### Command `j1-integration document`

```
Usage: j1-integration document [options]
Expand All @@ -1325,7 +1356,7 @@ Options:
on the metadata defined in each step. Documentation for an integration is stored
in the `{integration-project-dir}/docs/jupiterone.md` file by default.

#### `j1-integration validate-question-file`
#### Command `j1-integration validate-question-file`

JupiterOne managed question files live under an integration's `/jupiterone`
directory. For example `/jupiterone/questions.yaml`. The
Expand All @@ -1349,15 +1380,15 @@ Options:

##### More commands and options

###### `j1-integration plan`
###### Command `j1-integration plan`

We hope to make it easy for developers to understand how an integration collects
data and the order in which it performs work.

We hope to support a `j1-integration plan` command to display the dependency
graph of the steps and types required for a successful integration run.

###### `j1-integration sync --dry-run`
###### Command `j1-integration sync --dry-run`

A developer may want to have a better understanding of how synchronization of
collected data may affect their JupiterOne graph. We plan to support a
Expand All @@ -1368,7 +1399,7 @@ This dry run function will give metrics about how many creates, updates, and
deletes will be performed, categoried by the entity and relationhip `_type`
field.

###### `j1-integration generate`
###### Command `j1-integration generate`

A project generator might be helpful for getting new integration developers up
and running quickly. For our own integration developers, it would provide a
Expand Down
1 change: 1 addition & 0 deletions packages/integration-sdk-cli/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
"chalk": "^4",
"commander": "^5.0.0",
"globby": "^11.0.0",
"fs-extra": "^10.0.0",
"js-yaml": "^4.1.0",
"json-diff": "^0.5.4",
"lodash": "^4.17.19",
Expand Down
64 changes: 64 additions & 0 deletions packages/integration-sdk-cli/src/__tests__/cli.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,70 @@ describe('collect', () => {
loadProjectStructure('instanceWithDependentSteps');
});

test('option --use-dependencies-cache requires option --step', async () => {
await expect(
createCli().parseAsync([
'node',
'j1-integration',
'collect',
'--use-dependencies-cache',
]),
).rejects.toThrowError(
'Invalid option: Option --use-dependencies-cache requires option --step to also be specified.',
);
});

test('option --use-dependencies-cache limits steps to those specified', async () => {
loadProjectStructure('instanceWithDependentIgnoredSteps');

await createCli().parseAsync([
'node',
'j1-integration',
'collect',
'--step',
'fetch-groups',
'--use-dependencies-cache',
]);

expect(log.displayExecutionResults).toHaveBeenCalledTimes(1);
expect(log.displayExecutionResults).toHaveBeenCalledWith({
integrationStepResults: [
{
id: 'fetch-accounts',
name: 'Fetch Accounts',
declaredTypes: ['my_account'],
dependsOn: undefined,
partialTypes: [],
encounteredTypes: [],
status: StepResultStatus.CACHED,
},
{
id: 'fetch-groups',
dependsOn: ['fetch-accounts'],
name: 'Fetch Groups',
declaredTypes: ['my_groups'],
partialTypes: [],
encounteredTypes: [],
status: StepResultStatus.SUCCESS,
},
{
id: 'fetch-users',
name: 'Fetch Users',
declaredTypes: ['my_user'],
dependsOn: undefined,
partialTypes: [],
encounteredTypes: [],
status: StepResultStatus.DISABLED,
},
],
metadata: {
partialDatasets: {
types: [],
},
},
});
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but what happens in the case that the integration wasn't run yet and there is no data to copy over to the cache? Might be worth a test here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a step is tagged as using the cache but fails to load any entities/relationships, it marks that step as a failure. I'll see if I can get that in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test and it caused the other tests to fail. I may come back to this at another time.

test('loads the integration, executes it, and logs the result', async () => {
await createCli().parseAsync(['node', 'j1-integration', 'collect']);

Expand Down
55 changes: 55 additions & 0 deletions packages/integration-sdk-cli/src/commands/collect.ts
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
import { createCommand } from 'commander';
import path from 'path';
import fs from 'fs-extra';

import {
executeIntegrationLocally,
FileSystemGraphObjectStore,
getRootCacheDirectory,
getRootStorageDirectory,
prepareLocalStepCollection,
} from '@jupiterone/integration-sdk-runtime';

Expand All @@ -30,15 +33,34 @@ export function collect() {
collector,
[],
)
.option(
'-C, --use-dependencies-cache [filePath]',
'Loads cache for the dependencies required by the step(s) specified in --step option. Execution of these steps is skipped. Data found in .j1-integration is used if no filepath is provided.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More so for curiousity but we have the ability to create multiple dependency graphs. Have you tried to run that use case/does your code account for that? Maybe something we can address in a later PR if not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. I'd love to see that in action!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #619 to track

)
.option('-V, --disable-schema-validation', 'disable schema validation')
.action(async (options) => {
if (options.useDependenciesCache && options.step.length === 0) {
throw new Error(
'Invalid option: Option --use-dependencies-cache requires option --step to also be specified.',
);
}

// Point `fileSystem.ts` functions to expected location relative to
// integration project path.
process.env.JUPITERONE_INTEGRATION_STORAGE_DIRECTORY = path.resolve(
options.projectPath,
'.j1-integration',
);

if (
typeof options.useDependenciesCache === 'string' ||
options.useDependenciesCache instanceof String
ndowmon marked this conversation as resolved.
Show resolved Hide resolved
) {
setupCacheDirectory(options.useDependenciesCache);
} else if (options.useDependenciesCache === true) {
await copyToCache();
}

const config = prepareLocalStepCollection(
await loadConfig(path.join(options.projectPath, 'src')),
options,
Expand All @@ -60,10 +82,43 @@ export function collect() {
},
{
enableSchemaValidation,
useDependenciesCache: options.useDependenciesCache,
ndowmon marked this conversation as resolved.
Show resolved Hide resolved
graphObjectStore,
},
);

log.displayExecutionResults(results);
});
}

/**
* Sets environment variable JUPITERONE_CACHE_DIRECTORY
* to be used in reading & writing to cache.
* @param useDependenciesCache
*/
function setupCacheDirectory(useDependenciesCache) {
process.env.JUPITERONE_CACHE_DIRECTORY = path.resolve(
useDependenciesCache,
'.j1-cache',
);

log.info(
`Set dependencies cache location to ${process.env.JUPITERONE_CACHE_DIRECTORY}`,
);
}

/**
* When no filepath is specified, the .j1-integration directory
* is copied to .j1-cache
*/
async function copyToCache() {
const graphDirectory = path.join(getRootStorageDirectory(), 'graph');
if (fs.ensureDir(graphDirectory)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it is an async function? See: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/fs-extra/index.d.ts#L45

Can we install @types/fs-extra as a dev dep?

await fs.emptyDir(getRootCacheDirectory());
await fs.copy(graphDirectory, getRootCacheDirectory()).catch((error) => {
ndowmon marked this conversation as resolved.
Show resolved Hide resolved
log.error(`Failed to seed .j1-cache from .j1-integration`);
log.error(error);
});
log.info(`Copied graph data in .j1-integration to .j1-cache`);
}
}
1 change: 1 addition & 0 deletions packages/integration-sdk-cli/src/log.ts
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ function logStepStatus(stepResult: IntegrationStepResult) {
function getStepStatusText(status: StepResultStatus) {
switch (status) {
case StepResultStatus.SUCCESS:
case StepResultStatus.CACHED:
return chalk.green(status);
case StepResultStatus.FAILURE:
return chalk.red(status);
Expand Down
1 change: 1 addition & 0 deletions packages/integration-sdk-core/src/types/config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ export interface InvocationConfig<
> {
validateInvocation?: InvocationValidationFunction<TExecutionContext>;
getStepStartStates?: GetStepStartStatesFunction<TExecutionContext>;
dependentSteps?: string[];
integrationSteps: Step<TStepExecutionContext>[];
normalizeGraphObjectKey?: KeyNormalizationFunction;
beforeAddEntity?: BeforeAddEntityHookFunction<TExecutionContext>;
Expand Down
Loading