Skip to content

Commit

Permalink
Update Userguide documentation for v2.3 updates (#1100)
Browse files Browse the repository at this point in the history
### Feature or Bugfix
<!-- please choose -->
- Documentation

### Detail
- Updates to data.all Userguide to reflect changes for v2.3 release

### Relates
- N/A

### Security
N/A
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.

---------

Co-authored-by: dlpzx <[email protected]>
  • Loading branch information
noah-paige and dlpzx authored Mar 13, 2024
1 parent 91dd9fd commit 25a8a58
Show file tree
Hide file tree
Showing 13 changed files with 104 additions and 69 deletions.
Binary file modified UserGuide.pdf
Binary file not shown.
50 changes: 30 additions & 20 deletions documentation/userguide/docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,24 +91,23 @@ On left pane choose **Datasets**, then click on the **Create** button. Fill the
| Confidentiality | Level of confidentiality: Unclassified, Oficial or Secret | Yes | Yes | Secret
| Topics | Topics that can later be used in the Catalog | Yes, at least 1 | Yes | Finance
| Tags | Tags that can later be used in the Catalog | Yes, at least 1 | Yes | deleteme, ds

| Auto Approval | Whether shares for this dataset need approval from dataset owners/stewards | Yes (default `Disabled`) | Yes | Disabled, Enabled

## :material-import: **Import a dataset**


If you already have data stored on Amazon S3 buckets in your data.all environment, data.all got you covered with the import feature. In addition to
If you already have data stored on Amazon S3 buckets in your data.all environment, data.all has got you covered with the import feature. In addition to
the fields of a newly created dataset you have to specify the S3 bucket and optionally a Glue database and a KMS key Alias. If the Glue database
is left empty, data.all will create a Glue database pointing at the S3 Bucket. As for the KMS key Alias, data.all assumes that if nothing is specified
the S3 Bucket is encrypted with SSE-S3 encryption.
the S3 Bucket is encrypted with SSE-S3 encryption. Data.all performs a validation check to ensure the KMS Key Alias provided (if any) is the one that encrypts the S3 Bucket specified.

!!! danger "Imported KMS key and S3 Bucket policies requirements"
Data.all pivot role will handle data sharing on the imported Bucket and KMS key (if imported). Make sure that
the resource policies allow the pivot role to manage them. For the KMS key policy, explicit permissions are needed. See an example below.


### KMS key policy
In the KMS key policy we need to grant explicit permission to the pivot role. Note that this block is needed even if
permissions for the principal `"AWS": "arn:aws:iam::111122223333:root"` are given.
In the KMS key policy we need to grant explicit permission to the pivot role. At a minimum the following permissions are needed for the pivotRole:

```
{
Expand All @@ -126,29 +125,45 @@ permissions for the principal `"AWS": "arn:aws:iam::111122223333:root"` are give
"kms:ReEncrypt*",
"kms:TagResource",
"kms:UntagResource"
'kms:DescribeKey'
],
"Resource": "*"
}
```

!!!success "Update imported Datasets"
Imported keys is an addition of V1.6.0 release. Any previously imported bucket will have a KMS Key Alias set to `Undefined`.
If that is the case and you want to update the Dataset and import a KMS key Alias, data.all let's you edit the Dataset on the
**Edit** window.




![import_dataset](pictures/datasets/import_dataset.png#zoom#shadow)

| Field | Description | Required | Editable |Example
|------------------------|-------------------------------------------------------------------------------------------------|----------|----------|-------------
| Amazon S3 bucket name | Name of the S3 bucket you want to import | Yes | No |DOC-EXAMPLE-BUCKET
| Amazon KMS key Alias | Alias of the KMS key used to encrypt the S3 Bucket (do not include alias/<ALIAS>, just <ALIAS>) | Yes | No |somealias
| Amazon KMS key Alias | Alias of the KMS key used to encrypt the S3 Bucket (do not include alias/<ALIAS>, just <ALIAS>) | No | No |somealias
| AWS Glue database name | Name of the Glue database tht you want to import | No | No |anyDatabase

!!!success "Update imported Datasets"
Imported keys is an addition of V1.6.0 release. Any previously imported bucket will have a KMS Key Alias set to `Undefined`.
If that is the case and you want to update the Dataset and import a KMS key Alias, data.all let's you edit the Dataset on the
**Edit** window.

![import_dataset](pictures/datasets/import_dataset.png#zoom#shadow)
### (Going Further) Support for Datasets with Externally-Managed Glue Catalog

If the dataset you are trying to import relates to Glue Database that is managed in a separate account, data.all's import dataset feature can also handle importing and sharing these type of datasets in data.all. Assuming the following pre-requisites are copmlete:

- There exists an AWS Account (i.e. the Catalog Account) which is:
- Onboarded as a data.all environment (e.g. Env A)
- Contains the Glue Database with Location URI (as S3 Path from Dataset Producer Account) AND Tables
- Glue Database has a resource tag `owner_account_id=<PRODUCER_ACCOUNT_ID>`
- Data Lake Location registered in LakeFormation with the role used to register having permissions to the S3 Bucket from Dataset Producer Account
- Resource Link created on the Glue Database to grant permission for the Dataset Producer Account on the Database and Tables

- There exists another AWS Account (i.e. the Dataset Producer Account) which is:
- Onboarded as a data.all environment (e.g. Env B)
- Contains the S3 Bucket that contains the data (used as S3 Path in Catalog Account)

The data.all producer, a member of EnvB Team(s), would import the dataset specifying the S3 bucket as the bucket name that exists in the Dataset Producer Account and specifying the Glue database name as the Glue DB resource link name in the Dataset Producer Account.

This dataset will then be properly imported and can be discovered and shared the same way as any other dataset in data.all.

## :material-card-search-outline: **Navigate dataset tabs**

Expand Down Expand Up @@ -182,7 +197,7 @@ Moreover, AWS information related to the resources created by the dataset CloudF
AWS Account, Dataset S3 bucket, Glue database, IAM role and KMS Alias.

You can also assume this IAM role to access the S3 bucket in the AWS console by clicking on the **S3 bucket** button.
Alternatively, click on **AWS Credentials** to obtain programmatic access to the S3 bucket.
Alternatively, click on **AWS Credentials** to obtain programmatic access to the S3 bucket (only available if `modules.dataset.features.aws_actions` is set to `True` in the `config.json` used for deployment of data.all).

![overview](pictures/datasets/dataset_overview.png#zoom#shadow)

Expand Down Expand Up @@ -242,11 +257,6 @@ Catalog and then click on Synchronize as we did in step 3.
- Or directly, <a href="https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Hive_metastore_migration">migrating from Hive Metastore.</a>
- there are more for sure :)

!!! abstract "Use data.all pipelines to register Glue tables"
data.all pipelines can be used to transform your data including the creation of Glue tables using
<a href="https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-write"> AWS Glue pyspark extension </a>.
Visit the <a href="pipelines.html">pipelines section</a> for more details.


### Folders

Expand Down
Loading

0 comments on commit 25a8a58

Please sign in to comment.