From 75bc6f402dd3ceafb94742c4cd1097251004e7f9 Mon Sep 17 00:00:00 2001 From: Taylor Cox Date: Mon, 14 Jan 2019 17:23:01 -0800 Subject: [PATCH] Introduce Metastore Migration Script and Tests to HDI Github This changelist adds the necessary code and resources to launch the HDInsight metastore migration script, as well the metastore migration script tests. A readme is also included. --- HiveMetastoreMigration/MigrateMetastore.sh | 826 +++++++++++++++++ .../MigrateMetastoreTests.py | 869 ++++++++++++++++++ HiveMetastoreMigration/readme.md | 130 +++ .../test-resources/mockup-mandatory-arguments | 10 + .../test-resources/mockup-metastore-uris | 85 ++ 5 files changed, 1920 insertions(+) create mode 100644 HiveMetastoreMigration/MigrateMetastore.sh create mode 100644 HiveMetastoreMigration/MigrateMetastoreTests.py create mode 100644 HiveMetastoreMigration/readme.md create mode 100644 HiveMetastoreMigration/test-resources/mockup-mandatory-arguments create mode 100644 HiveMetastoreMigration/test-resources/mockup-metastore-uris diff --git a/HiveMetastoreMigration/MigrateMetastore.sh b/HiveMetastoreMigration/MigrateMetastore.sh new file mode 100644 index 0000000..db1027a --- /dev/null +++ b/HiveMetastoreMigration/MigrateMetastore.sh @@ -0,0 +1,826 @@ +#! /bin/bash -eux +# Migrate locations captured in a Hive Metastore from A to B + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Functions and globals. +############################################################################################################################################################################### +############################################################################################################################################################################### + +InfoString=$(cat <<-EOF + +`basename "$0"`: A script to edit the contents of a Hive metastore. Use this tool to bulk edit +the storage type, account name, container name, or directory of metastore entries. Multiple +attributes and multiple entries may be changed at one time. + +This script is able to accomplish tasks such as: + +"Move all wasb location to wasbs", +"Move the contents of storage accounts a, b, and c into container X of storage account Y" +"In containers a or b found in storage accounts x, y or z, move the path of all tables found +in /hive/warehouse to /warehouse/managed" + +While this command must be executed against an HDInsight cluster, the cluster need not be +connected to the metastore in order to make changes to it. + +Usage: sudo -E bash `basename "$0"` + + +Mandatory Arguments. Each one of these arguments must be provided: + +-u|--metastoreuser The username credential used to access the Hive metastore. + +-p|--metastorepassword Hive metastore password credential. + +-d|--metastoredatabase The name of the metastore database itself. + +-s|--metastoreserver The name of the SQL server that contains the Hive metastore. + Provide only the name of the server: there is no need to + provide the complete SQL database endpoint. + +-t|--target A comma-separated list of target metastore tables to be + migrated. This flag decides which metastore table will be + affected. Valid entries are: SKEWED_COL_VALUE_LOC_MAP, DBS, + SDS, FUNC_RU, or ALL to select all four tables. To move table + locations, use SDS. To move database locations, use DBS. If + you are not sure what you are trying to do, it is not + recommended to use this script. + +-q|--queryclient The command this script will use to execute SQL queries. + Currently supported clients are beeline and sqlcmd. + +-ts|--typesrc A comma-separated list of the storage types correpsonding to + the entries to be migrated. Example: abfs,wasb. Valid entries + are: wasb, wasbs, abfs, abfss, adl. Use the character '*' with + quotes to select all storage types. IF --typesrc includes '*' or + adl, then --adlaccounts must also be set. + +-cs|--containersrc A comma-separated list of the containers corresponding to the + entries to be migrated. Example: c1,c2,c3. Use the character '*' + with quotes to select all containers. No more than 10 containers + may be manually selected in one execution. + +-as|--accountsrc A comma-separated list of the account names corresponding to + the entries to be migrated. Example: a1,a2,a3. Provide only + the names of the accounts: there is no need to provide the + complete account endpoint. Use the character '*' with quotes + to select all account names. No more than 10 accounts may be + manually selected in one execution. If adl is the only entry + in --typesrc, this flag can be skipped. + +-ps|--pathsrc A comma-separated list of the paths corresponding to the + entries to be migrated. Example: + warehouse,hive/tables,ext/tables/hive. Provide only the + names of the paths: there is no need to provide the '/' + character before and after the path. Use the character '*' + with quotes to select all paths. No more than 10 paths may + be manually selected in one execution. + + +Optional Arguments. Any combination of these arguments is a valid input. +Note: leaving all inputs blank will result in no change: + +-adls|--adlaccounts A comma-separated list of the Azure Data Lake Storage Gen 1 + account names corresponding to the entries to be migrated. + This flag operates identically to --accountsrc. ADL accounts + must be specified manually since ADL accounts can have the same + name as other storage accounts. Use the character '*' with quotes + to select all ADL account names. No more than 10 ADL accounts may be + manually selected in one execution. + +-e|--environment The name of the Azure Environment this script is to be executed on. + Options for this flag are China, Germany, USGov and Default. If this + flag is omitted, the default option will be used. Note that Azure + Data Lake as a source or destination type is only supported when using + the Default Azure environment. If --typesrc is set to '*' and --environment + is not default or blank, ADL accounts will be ignored. + +-td|--typedest A string corresponding to the storage type that all matches + will be moved to. Valid entries are: wasb, wasbs, abfs, abfss, + adl. If this value is left blank or omitted, no change will + be made to the storage type. + +-ad|--accountdest A string corresponding to the account that all matches will + be moved to. If this value is left blank or omitted, no + change will be made to the account. If --accountdest is set, + --typedest must also be set. + +-cd|--containerdest A string corresponding to the container that all matches will + be moved to. If this value is left blank or omitted, no change + will be made to the container. + +-pd|--pathdest A string corresponding to the path that all matches will be + moved to. If this value is left blank or omitted, no change + will be made to the path. + +-l|--liverun This argument is a flag, not a paremeter. If --liverun is used, + the flag is not to be accompanied by a value. --liverun executes + this script 'live', meaning that the specified metastore will be + written to as specified by the other parameters passed. Omit this + flag to launch a dry run of the script (default). + +-h|--help Display this message. + +The 'source' flags work together such that a regex-like table location is built. Example: +[abfs,wasb]://[c1,c2,c3]@[a1,a2,a3]/[p1,p2,p3]/. Every location that matches the pattern +formed by the source flags will be converted into the location constructed by the destination +flags. The destination flags specify which location attributes (type, account, container, path) +to change, and what to change those attributes to. This script does NOT require Ambari credentials. + +Note: It is strongly recommended to redirect stdout to a file when executing this script, as there +is a large amount of text that will be logged (especially in cases where the target metastore is at-scale) + +EOF +) + +# Exit codes for easier testing +EXIT_SQL_FAIL=25 +EXIT_BAD_ARGS=50 +EXIT_NO_CHANGE=75 +EXIT_DRY_RUN=100 + +# Arg Batch Limit (Does not apply to wildcards. This is only to upper-bound the complexity of generating the WHERE clause later) +MAX_ARG_COUNT=10 + +# Template to generate the expanded WHERE clause +WhereClauseString='*://*@*/*/%' # Type://Container@Account/Directory/Table +# Template to execute replacements against URI attributes stored as columns in a temp table +UpdateTemplate="update LocationUpdate set * = ('#'); " +# String to sequence of filled-in update templates +UpdateCommands='' + +# Map for Endpoints and Environments so different storage accounts and different azure clouds are supported +# ADL v1 is a special case: Only supported in public cloud, and its endpoint is '.azuredatalakestore.net'. There is no '.core.' +# See https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-connectivity-from-vnets + +declare -A EndpointMap +EndpointMap[wasb]=blob +EndpointMap[wasbs]=blob +EndpointMap[abfs]=dfs +EndpointMap[abfss]=dfs +EndpointMap[adl]=azuredatalakestore + +# Storage domain endpoints, from https://docs.microsoft.com/en-us/azure/storage/common/storage-powershell-independent-clouds +declare -A EnvironmentMap +EnvironmentMap[china]=cn +EnvironmentMap[default]=net +EnvironmentMap[germany]=de +EnvironmentMap[usgov]=net +EnvironmentMap[adl]=net + +# Storage domain strings excluding final domain, from same doc +declare -A StorageDomainMap +StorageDomainMap[china]=.core.chinacloudapi +StorageDomainMap[default]=.core.windows +StorageDomainMap[germany]=.core.cloudapi +StorageDomainMap[usgov]=.core.usgovcloudapi +# ADL has no domain map entry because the domain for ADL is blank: azuredatalakestore.net + +# These triples store the relevant metastore table and the columns of interest: +# (table_name, uri_column, primary_key) +SDSColumns=(SDS location sd_id) +DBSColumns=(DBS db_location_uri db_id) +FUNCRUColumns=(FUNC_RU resource_uri func_id) +SKEWEDCOLVALUELOCMAPColumns=(SKEWED_COL_VALUE_LOC_MAP location sd_id) + +# Store position arguments to be fed into parameters +POSITIONAL=() +# This will be one of the triples specified above depending on the target set (SDS, DBS, FUNC_RU, SKEWED_COL_VALUE_LOC_MAP) +TargetAttrs=() +# Expanded WHERE clause string with the Type attribute filled in +WhereClauseStringsTypeSet=() +# Ditto for Container also +WhereClauseStringsTypeContainerSet=() +# Ditto for Account also +WhereClauseStringsTypeContainerAccountSet=() +# Ditto for Path also. This will include every WHERE clause URIs must match. example: WHERE location like X, Y, Z, ... +WhereClauseStringsTypeContainerAccountPathSet=() + +# Supported SQL Clients and the commands to execute migration commands with them +# The templates follow the same format: +declare -A sqlClientMap +sqlClientMap[beeline]="beeline --outputformat=csv2 -u 'jdbc:sqlserver://%s.database.windows.net;database=%s' -n '%s' -p '%s' -f '%s'" +sqlClientMap[sqlcmd]="sqlcmd -s\",\" -W -S %s.database.windows.net -d %s -U %s -P %s -i %s" + +launchSqlCommand() +{ + scriptFile=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 32 | head -n 1) + scriptFile="$scriptFile-migrationcommand" + touch ${scriptFile} + echo ${6} >> ${scriptFile} + + command=$(printf "${sqlClientMap["$1"]}" "$2" "$3" "$4" "$5" "${scriptFile}") + echo ${command} + eval "${command}" + if [ ! $? -eq 0 ] + then + echo "SQL command failed with exit code ${?}" + echo "Command executed was ${command}" + echo "Closing." + exit ${EXIT_SQL_FAIL} + fi + rm -f ${scriptFile} +} + +usage() +{ + echo "$InfoString" + echo "Unexpected Argument: $1" + exit ${EXIT_BAD_ARGS} +} + +constructCloudEndpointFilter() +{ + CloudEndptWhereClause="( ${1} like ('%${3}.${4}%') " + if [ "${2}" -eq "default" ] + then + CloudEndptWhereClause="$CloudEndptWhereClause or ${1} like ('%azuredatalakestore.${4}%') " + fi + CloudEndptWhereClause="$CloudEndptWhereClause )" + echo "$CloudEndptWhereClause" +} + +constructURIExtractString() +{ + # Take in Target Attribute keystore and string for endpoint + AttributeStore=($1) + + EndpointValue=${2} + ExtractionString=$(cat <<-EOF + +insert into LocationUpdate +( + FK_ID, + SCHEMATYPE, + CONTAINER, + ACCOUNT, + ENPT, + ROOTPATH +) +select ${AttributeStore[2]}, +substring +( + ${AttributeStore[1]}, + 0, + CHARINDEX('://', ${AttributeStore[1]}) +), +substring +( + ${AttributeStore[1]}, + CHARINDEX('://', ${AttributeStore[1]}) + len('://'), + CHARINDEX('@', ${AttributeStore[1]}) - ( CHARINDEX('://', ${AttributeStore[1]}) + len('://') ) +), +substring +( + ${AttributeStore[1]}, + CHARINDEX('@', ${AttributeStore[1]}) + len('@'), + CHARINDEX('.', ${AttributeStore[1]}) - ( CHARINDEX('@', ${AttributeStore[1]}) + len('@') ) +), +substring +( + ${AttributeStore[1]}, + CHARINDEX('.', ${AttributeStore[1]}) + len('.'), + CHARINDEX('.${EndpointValue}', ${AttributeStore[1]}) - ( CHARINDEX('.', ${AttributeStore[1]}) + len('.') ) +), +substring +( + ${AttributeStore[1]}, + CHARINDEX('.${EndpointValue}', ${AttributeStore[1]}) + len('.${EndpointValue}'), + (len(${AttributeStore[1]}) - CHARINDEX('/', reverse(${AttributeStore[1]})) + 1) - (CHARINDEX('.${EndpointValue}', ${AttributeStore[1]}) + len('.${EndpointValue}') - 1) +) + +from ${AttributeStore[0]} WHERE +${3} and (${4}); + +EOF +) + + echo "$ExtractionString" +} + +constructMigrationImpactDisplayString() +{ + AttributeStore=($1) + MigrationImpactString=$(cat <<-EOF + +select LocationUpdate.FK_ID as ID, ${AttributeStore[1]} as oldLocation, +( + SCHEMATYPE + + '://' + + CONTAINER + + '@' + + ACCOUNT + + '.' + + ENPT + + '.${EnvironmentMap[${2}]}' + + ROOTPATH + + right( ${AttributeStore[1]}, charindex('/', reverse(${AttributeStore[1]}) ) -1) +) as newLocation +from ${AttributeStore[0]}, LocationUpdate +where ${AttributeStore[2]} = LocationUpdate.FK_ID; + +EOF +) + + echo "$MigrationImpactString" +} + +constructFinalMigrationString() +{ + AttributeStore=($1) + FinalMigrationString=$(cat <<-EOF + +update ${TargetAttrs[0]} set ${TargetAttrs[1]} = +( + SCHEMATYPE + + '://' + + CONTAINER + + '@' + + ACCOUNT + + '.' + + ENPT + + '.${EnvironmentMap[${2}]}' + + ROOTPATH + + right( ${TargetAttrs[1]}, charindex('/', reverse(${TargetAttrs[1]}) ) -1) +) +from ${TargetAttrs[0]}, LocationUpdate +where ${TargetAttrs[2]} = LocationUpdate.FK_ID; + +EOF +) + + echo "$FinalMigrationString" +} + +constructMigrationTableCreateString() +{ + AttributeStore=($1) + MigrationTableCreateString=$(cat <<-EOF + +drop table if exists LocationUpdate; +create table LocationUpdate +( + SCHEMATYPE nvarchar(1024), + CONTAINER nvarchar(1024), + ACCOUNT nvarchar(1024), + ENPT nvarchar(1024), + ROOTPATH nvarchar(1024), + FK_ID bigint +); + +EOF +) + + echo "$MigrationTableCreateString" +} + +constructInvalidArgumentString() +{ + echo "${1} is not a valid entry for ${2}." +} +############################################################################################################################################################################### +############################################################################################################################################################################### +# Read inputs. +############################################################################################################################################################################### +############################################################################################################################################################################### + +while [ $# -gt 0 ] +do + key="$1" + + case $key in + -u|--metastoreuser) + Username="$2" + shift + shift + ;; + -p|--metastorepassword) + Password="$2" + shift + shift + ;; + -s|--metastoreserver) + Server="$2" + shift + shift + ;; + -d|--metastoredatabase) + Database="$2" + shift + shift + ;; + -q|--queryclient) + Client="$2" + shift + shift + ;; + -t|--target) + Target="$2" + shift + shift + ;; + # These have to be parsed for the separating comma + -ts|--typesrc) + TypeSrc="$2" + shift + shift + ;; + -as|--accountsrc) + AccountSrc="$2" + shift + shift + ;; + -cs|--containersrc) + ContainerSrc="$2" + shift + shift + ;; + -ps|--pathsrc) + RootpathSrc="$2" + shift + shift + ;; + -td|--typedest) + TypeDest="$2" + shift + shift + ;; + -adls|--adlaccounts) + ADLAccountSrc="$2" + shift + shift + ;; + -e|--environment) + AzureEnv="$2" + shift + shift + ;; + -ad|--accountdest) + AccountDest="$2" + shift + shift + ;; + -cd|--containerdest) + ContainerDest="$2" + shift + shift + ;; + -pd|--pathdest) + RootpathDest="$2" + shift + shift + ;; + -l|--liverun) + Liverun=true + shift + ;; + -h|--help) + usage "Help flag entered. Closing." + ;; + + *) + usage "$1 is an unsupported flag. Closing." + ;; +esac +done +set +o nounset +set -- "${POSITIONAL[@]}" + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Now we have all the inputs. Run parameter validation. +############################################################################################################################################################################### +############################################################################################################################################################################### + +# Translate wildcards from argument '*' to character % for SQL parity +SrcFlags=("TypeSrc" "AccountSrc" "ADLAccountSrc" "ContainerSrc" "RootpathSrc") + +for i in "${!SrcFlags[@]}"; do + + if [ ! -z "${!SrcFlags[$i]}" -a "${!SrcFlags[$i]}" = "*" ] + then + eval "${SrcFlags[$i]}=%" + fi +done + +# Check Mandatory Flags Present +argsMissing=false +ArgumentNames=("Client" "Username" "Password" "Server" "Database" "Target" "TypeSrc" "ContainerSrc" "RootpathSrc") + +for i in "${!ArgumentNames[@]}"; do + + if [ -z "${!ArgumentNames[$i]}" ] + then + echo "${ArgumentNames[$i]} missing from command line arguments." + argsMissing=true + fi +done + +if [ "${argsMissing}" = true ] +then + usage "At least one mandatory argument missing." +fi + +# Check ADL list not left empty if ADL or % in typesrc +if [ -z "${ADLAccountSrc}" ] && [ "${TypeSrc}" = "%" -o ! -z "$(echo "${TypeSrc}" | grep -i adl)" ] +then + usage "ADL cannot be specified as a source type without also specifying ADL account names to be moved. Use '%' to move all ADL accounts." +fi + +# Parse Inputs +IFS=',' read -r -a TypeSrcVals <<< "$TypeSrc" +IFS=',' read -r -a AccountSrcVals <<< "$AccountSrc" +IFS=',' read -r -a ContainerSrcVals <<< "$ContainerSrc" +IFS=',' read -r -a RootpathSrcVals <<< "$RootpathSrc" +IFS=',' read -r -a ADLAccountSrcVals <<< "$ADLAccountSrc" +IFS=',' read -r -a TargetVals <<< "$Target" + +# Make sure that AccountSrc is present unless the only TypeSrc is ADL +if [ "${#TypeSrcVals[@]}" -gt 1 ] || [ ! "${TypeSrcVals[0]}" = "adl" ] +then + if [ -z "${AccountSrc}" ] + then + usage "Account types other than ADL set but AccountSrc missing from command line arguments." + fi +fi + +# Make sure no invalid target tables are specified +for entry in "${TargetVals[@]}" +do + if [ ! "${entry,,}" = "dbs" ] && [ ! "${entry,,}" = "sds" ] && [ ! "${entry,,}" = "func_ru" ] && [ ! "${entry,,}" = "skewed_col_value_loc_map" ] + then + usage "$(constructInvalidArgumentString "${entry}" "--target" ) Targets must be one or more of DBS, SDS, FUNC_RU or SKEWED_COL_VALUE_LOC_MAP" + fi +done + +# Make sure that there are not too many input values +if [ "${#AccountSrcVals[@]}" -gt "$MAX_ARG_COUNT" ] || [ "${#ContainerSrcVals[@]}" -gt "$MAX_ARG_COUNT" ] || [ "${#RootpathSrcVals[@]}" -gt "$MAX_ARG_COUNT" ] || [ "${#ADLAccountSrcVals[@]}" -gt "$MAX_ARG_COUNT" ] +then + usage "Too many entries specified. Closing." +fi + +# Make sure types are valid for source +for item in "${TypeSrcVals[@]}" +do + if [ -z "${EndpointMap[${item}]}" -a ! "${item}" = "%" ] + then + usage "$(constructInvalidArgumentString "${TypeSrc}" "--typesrc" ) Storage types must be one or more of $(echo ${!EndpointMap[@]})." + fi +done + +if [ -z "${AzureEnv}" ] +then + AzureEnv="default" +fi + +# Make sure environment is valid +if [ -z "${EnvironmentMap[${AzureEnv}]}" ] +then + usage "$(constructInvalidArgumentString "${AzureEnv}" "--environment" ) Azure Environment must be one of $(echo ${!EnvironmentMap[@]})." +fi + +# Make sure target type is valid +if [ ! -z "${TypeDest}" ] +then + if [ -z "${EndpointMap[${TypeDest}]}" ] + then + usage "$(constructInvalidArgumentString "${TypeDest}" "--typedest" ). Storage types must be one or more of $(echo ${!EndpointMap[@]})." + fi +fi + +# Check Query Client is supported +if [ -z "${sqlClientMap[${Client}]}" ] +then + usage "$(constructInvalidArgumentString "${Client}" "--queryclient" ) Query Client must be one of $(echo ${!sqlClientMap[@]})." +fi + +# Check account name not missing type +if [ -z "${TypeDest}" -a ! -z "${AccountDest}" ] +then + usage "Destination account cannot be specified without destination account type" +fi + +# Make sure ADL not specified when using a non-default cloud +if [ ! "${AzureEnv}" = "default" ] +then + if [ "${TypeDest}" = "adl" ] + then + usage "Cannot include Azure Data Lake as the target type when working in non-default cloud: ${AzureEnv}." + fi + + # Is ADL in the array of source types? + if [ ! -z "$(echo "${TypeSrcVals[@]}" | grep -i adl)" ] || [ "${TypeSrcVals[0]}" = "%" ] + then + usage "Cannot include Azure Data Lake as a source type when working in non-default cloud: ${AzureEnv}" + fi +fi + +# If all dests are empty, exit +if [ -z "${TypeDest}" -a -z "${AccountDest}" -a -z "${ContainerDest}" -a -z "${RootpathDest}" ] +then + echo "At least one of --typedest, --accountdest, --containerdest or --pathdest must be set." + echo "No destination attributes set. Nothing to do." + exit ${EXIT_NO_CHANGE} +fi + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Parameters are now validated. Here the WHERE clause is constructed for finding URIs that will be migrated +############################################################################################################################################################################### +############################################################################################################################################################################### + + +if [ ! -z "${TypeDest}" ] +then + # set the correct protocol prefix that matches the storage type + UpdateCommands="$UpdateCommands $(sed "s/*/SCHEMATYPE/g; s/#/${TypeDest}/g" <<< $UpdateTemplate)" + # set the correct endpoint that matches the storage type, excluding the final domain + AzureEnvDest="${AzureEnv}" + if [ "${TypeDest}" = "adl" ] + then + AzureEnvDest=adl + fi + + UpdateCommands="$UpdateCommands $(sed "s/*/ENPT/g; s/#/${EndpointMap[${TypeDest}]}${StorageDomainMap[${AzureEnvDest}]}/g;" <<< $UpdateTemplate)" +fi + +for item in "${TypeSrcVals[@]}" +do + WhereClauseStringsTypeSet+=("$(sed "s/*/$item/1" <<< $WhereClauseString)") +done + +if [ ! -z "${ContainerDest}" ] +then + # ://%@ -> ://dest_container@ + UpdateCommands="$UpdateCommands $(sed "s/*/CONTAINER/g; s/#/${ContainerDest}/g" <<< $UpdateTemplate)" +fi + +for TypeSetTemplate in "${WhereClauseStringsTypeSet[@]}" +do + for item in "${ContainerSrcVals[@]}" + do + WhereClauseStringsTypeContainerSet+=("$(sed "s/*/$item/1" <<< $TypeSetTemplate)") + done +done + +if [ ! -z "${AccountDest}" ] +then + # @%. -> @dest_account. + UpdateCommands="$UpdateCommands $(sed "s/*/ACCOUNT/g; s/#/${AccountDest}/g" <<< $UpdateTemplate)" +fi + +for TypeContainerSetTemplate in "${WhereClauseStringsTypeContainerSet[@]}" +do + if [ -z "$(echo "${TypeContainerSetTemplate}" | grep -i "adl://")" ] + then # This is a non-ADL template + for item in "${AccountSrcVals[@]}" + do + WhereClauseStringsTypeContainerAccountSet+=("$(sed "s/*/$item\.%\.${EnvironmentMap[${AzureEnv}]}/1" <<< $TypeContainerSetTemplate)") + done + else + for item in "${ADLAccountSrcVals[@]}" # Use a different list of source entries and a different endpoint style if the template is ADL + do + WhereClauseStringsTypeContainerAccountSet+=("$(sed "s/*/$item\.%\.${EnvironmentMap[adl]}/1" <<< $TypeContainerSetTemplate)") + done + fi +done + +if [ ! -z "${RootpathDest}" ] +then + # Path update is a little different: Replace longest left-hand subsequence. Root path migration that preserves hierarchy + if [ ! "${RootpathSrcVals[0]}" = "%" ] + then + SrcPathValuesStr=$( printf "('/%s/'), " "${RootpathSrcVals[@]}" | sed 's/,.$//' ) + PathReplaceCommand=$(cat <<-EOF + drop table if exists srcpaths; + create table srcpaths (srcpathname nvarchar(1024)); + insert into srcpaths values $SrcPathValuesStr; + + update LocationUpdate set ROOTPATH = Replace + ( + ROOTPATH, + ( + select ISNULL( (select top 1 srcpathname from srcpaths where ROOTPATH like (srcpathname + '%') order by srcpathname desc), '') + ), + '/$RootpathDest/' + ); + drop table srcpaths; +EOF +) + UpdateCommands="$UpdateCommands $PathReplaceCommand" + else + UpdateCommands="$UpdateCommands $(sed "s/*/ROOTPATH/g; s/#/\/${RootpathDest}\//g" <<< $UpdateTemplate)" + fi +fi + +for TypeContainerAccountSetTemplate in "${WhereClauseStringsTypeContainerAccountSet[@]}" +do + for item in "${RootpathSrcVals[@]}" + do + WhereClauseStringsTypeContainerAccountPathSet+=("$(sed "s:*:$item:1" <<< $TypeContainerAccountSetTemplate)") + done +done + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Launch SQL commands +# 1. Create workbench table for migration +# 2. Insert items into workbench table that match expanded WHERE clause. The matches are parsed for their attributes, which are inserted into the appropriate column. +# 3. Update column entries in the workbench table. +# 3b. The path entry is updated by replacing the longest (from the root) source path found in the entry. That way the hierarchy of subfolders is preserved. +# 4. Update the original table (sds, dbs, etc) by putting the location string back together and matching the ID column +############################################################################################################################################################################### +############################################################################################################################################################################### + +echo Username = "${Username}" +echo Password = "${Password}" +echo Server = "${Server}" +echo Database = "${Database}" +echo Azure Environment = "${AzureEnv}" +echo Running Live\? "${Liverun}" + +echo Target\(s\) = "${TargetVals}" +echo Source Type\(s\) = "${TypeSrc}" +echo Account\(s\) = "${AccountSrc}" +echo Container\(s\) = "${ContainerSrc}" +echo Source Rootpath\(s\) = "${RootpathSrc}" + +echo Final Type \(if any\) = "${TypeDest}" +echo Final Account \(if any\) = "${AccountDest}" +echo Final Container \(if any\) = "${ContainerDest}" +echo Final Rootpath \(if any\) = "${RootpathDest}" + +# Expand out Targets if "ALL" was used +if [ "${TargetVals[0],,}" = "all" ] +then + TargetVals=(DBS SDS FUNC_RU SKEWED_COL_VALUE_LOC_MAP) +fi + +# Launch the sequence of SQL commands via beeline for each target table specified +for entry in "${TargetVals[@]}" +do + if [ "${entry,,}" = "dbs" ] + then + TargetAttrs=("${DBSColumns[@]}") + elif [ "${entry,,}" = "sds" ] + then + TargetAttrs=("${SDSColumns[@]}") + elif [ "${entry,,}" = "func_ru" ] + then + TargetAttrs=("${FUNCRUColumns[@]}") + elif [ "${entry,,}" = "skewed_col_value_loc_map" ] + then + TargetAttrs=("${SKEWEDCOLVALUELOCMAPColumns[@]}") + fi + + EndpointClause=$(constructCloudEndpointFilter "${TargetAttrs[1]}" "${AzureEnv}" "${StorageDomainMap[${AzureEnv}]}" "${EnvironmentMap[${AzureEnv}]}") + + WhereClauseString=$(printf " OR ${TargetAttrs[1]} like ('%s')" "${WhereClauseStringsTypeContainerAccountPathSet[@]}" | cut -c 5-) + MigrationTableCreateString=$(constructMigrationTableCreateString "$(echo ${TargetAttrs[@]})" ) + ExtractionString=$(constructURIExtractString "$(echo ${TargetAttrs[@]})" "${EnvironmentMap[${AzureEnv}]}" "${EndpointClause}" "${WhereClauseString}" ) + DisplayMigrationImpactString=$(constructMigrationImpactDisplayString "$(echo ${TargetAttrs[@]})" "${AzureEnv}" ) + ExecuteFinalMigrationString=$(constructFinalMigrationString "$(echo ${TargetAttrs[@]})" "${AzureEnv}" ) + CleanupMigrationTableString="drop table LocationUpdate;" + + echo "Generated Migration SQL Script:" + + echo "$MigrationTableCreateString" + echo "$ExtractionString" + echo "$UpdateCommands" + echo "$DisplayMigrationImpactString" + echo "$ExecuteFinalMigrationString" + echo "$CleanupMigrationTableString" + + echo "Launching Migration SQL Commands:" + + echo "Creating temporary migration table..." + launchSqlCommand $Client $Server $Database $Username $Password "${MigrationTableCreateString}" + + echo "Extracting matching URIs to temporary table..." + launchSqlCommand $Client $Server $Database $Username $Password "${ExtractionString}" + + echo "Modiying temporary table per migration paremeters..." + launchSqlCommand $Client $Server $Database $Username $Password "${UpdateCommands}" + + echo "Affected entries of ${TargetAttrs[0]} will have the following values pre and post-migration..." + launchSqlCommand $Client $Server $Database $Username $Password "${DisplayMigrationImpactString}" + echo "" + + if [ "$Liverun" = true ] + then + echo "Script execution type set to LIVE. Writing migration results to table: ${TargetAttrs[0]}..." + launchSqlCommand $Client $Server $Database $Username $Password "${ExecuteFinalMigrationString}" + else + echo "LIVE flag not set. Migration execution skipped." + fi + + echo "Deleting temporary migration table..." + launchSqlCommand $Client $Server $Database $Username $Password "${CleanupMigrationTableString}" +done + +echo "Migration script complete!" +if [ ! "$Liverun" = true ] +then + echo "Closing with dry run exit code" + exit ${EXIT_DRY_RUN} +fi diff --git a/HiveMetastoreMigration/MigrateMetastoreTests.py b/HiveMetastoreMigration/MigrateMetastoreTests.py new file mode 100644 index 0000000..b728aa0 --- /dev/null +++ b/HiveMetastoreMigration/MigrateMetastoreTests.py @@ -0,0 +1,869 @@ +#!/usr/bin/env python3 +# A python script that tests MigrateMetastore.sh + +""" +This python script is meant to be used alongside MigrateMetastore.sh for validation and development. Therefore, this script should be executed from the headnode of an HDInsight cluster. +The script takes in a SQL DB to run the tests against. Some tables will be created and deleted on the database, and some sample data will be read from and written to those tables. + +The test suite uses pyodbc, a setup guide for which can be found here: https://www.microsoft.com/en-us/sql-server/developer-get-started/python/ubuntu/ +The test suite also uses python 3.6, which can be installed if needed: https://askubuntu.com/questions/865554/how-do-i-install-python-3-6-using-apt-get + +Arguments: +- Relative path to migration script +- SQL server +- SQL database +- SQL username +- SQL password +""" + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Imports, Variables, Classes, Functions +############################################################################################################################################################################### +############################################################################################################################################################################### + +from collections import namedtuple +from pyodbc import connect +from os import path +from optparse import OptionParser +from re import escape +from re import search +from re import sub +from select import poll +from subprocess import PIPE +from subprocess import Popen +from sys import argv +from time import sleep + +TOTAL_REQ_ARGS = 7 + +# Exit codes for testing taken from script source +EXIT_SUCCESS = 0 +EXIT_BEELINE_FAIL = 25 +EXIT_BAD_ARGS = 50 +EXIT_NO_CHANGE = 75 +EXIT_DRY_RUN = 100 + +MAX_SRC_ARGS = 10 + +ASSETS_DIR = "test-resources" +ARG_FILE = "mockup-mandatory-arguments" +URI_FILE = "mockup-metastore-uris" + +STR_TYPE = "varchar(4000)" +BIGINT_TYPE = "bigint" +SET_PRMK = "primary key" + +INPUT_TEST_STR = "InputTests" +EXEC_TEST_STR = "ExecTests" + +ARG_WILDCARD = "*" + +MOCKUP_FUNC_RU = "FUNC_RU" +MOCKUP_SDS = "SDS" # This value must correspond to the entry in resources/mockup-mandatory-arguments +MOCKUP_DBS = "DBS" +MOCKUP_SKEWED_COL_VALUE_LOC_MAP = "SKEWED_COL_VALUE_LOC_MAP" + +FlagsMap = {} +FlagsMap["--metastoreserver"] = "Server" +FlagsMap["--metastoredatabase"] = "Database" +FlagsMap["--metastoreuser"] = "Username" +FlagsMap["--metastorepassword"] = "Password" +FlagsMap["--typesrc"] = "TypeSrc" +FlagsMap["--containersrc"] = "ContainerSrc" +FlagsMap["--accountsrc"] = "AccountSrc" +FlagsMap["--pathsrc"] = "RootpathSrc" +FlagsMap["--target"] = "Target" +FlagsMap["--queryclient"] = "Client" + +Match = namedtuple('Match', ['index', 'entry', 'transformedentry']) + +class MigrationScriptCommandFailedException(Exception): + pass + +def dctToList(dct): + dctlist = [] + for key in dct: + temp = [key,dct[key]] + dctlist.extend(temp) + return dctlist + +def getArgs(): + parser = OptionParser() + + parser.add_option("-m", "--migrationScriptPath", dest = "scriptPath", help = "Relative path to the metastore migration script") + + parser.add_option("-t", "--testSuites", dest = "suites", help = "A comma-separated list of the test suites to run: {0}, {1}, or All".format(INPUT_TEST_STR, EXEC_TEST_STR)) + + parser.add_option("-s", "--server", dest = "server", help = "Address of the SQL server to be used for tests. Not used if only testing behavior of inputs.") + + parser.add_option("-d", "--database", dest = "database", help = "Name of the database within the server to use for tests. Not used if only testing behavior of inputs.") + + parser.add_option("-u", "--username", dest = "username", help = "Username to log into the test server. Not used if only testing behavior of inputs.") + + parser.add_option("-p", "--password", dest = "password", help = "Password to log into the test server. Not used if only testing behavior of inputs.") + + parser.add_option("-r", "--driver", dest = "driver", help = "The name of the ODBC driver to use to connect to the server. Not used if only testing behavior of inputs.") + + parser.add_option("-c", "--cleanup", action = "store_true", dest = "cleanupOnExit", default = False, + help = "Set this flag to make sure tables created by the tests are dropped. Tables will also be dropped in case of test failure.") + + (options, args) = parser.parse_args() + + # Validate + assert(options.scriptPath is not None), "--migrationScriptPath must be specified. --help for more instructions." + assert(options.suites is not None), "--testSuites must be specified. --help for more instructions." + + if options.suites.lower() == 'all': + options.suites = "{0},{1}".format(INPUT_TEST_STR, EXEC_TEST_STR) + + dbParams = ['server', 'database', 'username', 'password', 'driver' ] + assert( not( EXEC_TEST_STR in options.suites and any( options.__dict__[item] is None for item in dbParams ) ) ), \ + "The following arguments are missing since {0} is included in tests to run: {1}".format(EXEC_TEST_STR, ["--"+item for item in dbParams if options.__dict__[item] is None]) + + return options + +def listEqUnordered(l1, l2): + sameLen = (len(l1) == len(l2) ) + sameItems = (sorted(l1) == sorted(l2)) + return sameLen and sameItems + +def rootpathInURI(path, uri): + entryAfterScheme = uri[uri.find("://") + len("://") : len(uri)] + uriPathComponent = entryAfterScheme[entryAfterScheme.find("/") : len(entryAfterScheme)] + return uriPathComponent.startswith("/"+path+"/") + +def runCommand(cmd, args=None, switches=None): + fault = False + stdout = "" + stderr = "" + exitCode = None + execution = [cmd] + if args: + # Args is a keystore of {argname: value} + execution.extend(dctToList(args)) + if switches: + # Switches is an array of flags that do not require correspoding values + execution.extend(switches) + try: + process = Popen( + execution, + stderr = PIPE, + stdout = PIPE, + encoding = 'utf8' # Communicating with the script requires utf-8, which requires python 3.6 + ) + + stdout, stderr = process.communicate() + exitCode = process.returncode + except Exception as exc: + print("Caught exception when executing {0}: {1}".format(cmd, exc)) + fault = True + + if fault: + raise MigrationScriptCommandFailedException( + "\n\nError executing {0} with arguments: '{1}'.\nStdout:\n{2}Stderr:\n{3}\nCommand gave exit code {4}. Closing.".format(cmd, args, stdout, stderr, exitCode) + ) + + return exitCode, stdout + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Shared test code +############################################################################################################################################################################### +############################################################################################################################################################################### + +class MigrationScriptTestSuite: + + def __init__( + self, + Name, + ScriptPath, + Server = None, + DB = None, + User = None, + Pass = None, + Driver = None, + CleanupOnExit = None + ): + self.TablesCreated = False + for entry, value in locals().items(): + setattr(self, entry, value) + + self.BaseArguments = dict(item.split() for item in open(path.join(ASSETS_DIR, ARG_FILE), "r").read().splitlines()) + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_value, traceback): + exitMsg="Succeeded" if exc_type == None else "Failed" + print("Migration Script test suite: {0}. result: {1}!".format(self.Name, exitMsg)) + + def getScriptArgumentsWithOverrides(self, argOverrides=None, argDeletions=None): + # Use BaseArguments as start point and add/edit/delete as needed. Both args are arrays of 2-tuples + cmdArgs = dict(self.BaseArguments) + + if argOverrides: + for override in argOverrides: + flag, newValue = override + cmdArgs[flag] = newValue + + if argDeletions: + for deletion in argDeletions: + if deletion in cmdArgs: + del cmdArgs[deletion] + + return cmdArgs + + def checkExitCodeEqual(self, args, expectedCode, actualCode, output): + assert(expectedCode == actualCode), "{0} with arguments(s) {1} expected exit code {2} but gave exit code {3}. Output was '{4}'.".format(self.ScriptPath, args, expectedCode, actualCode, output) + + def checkOutputContains(self, args, targetOutput, actualOutput): + assert(targetOutput in actualOutput), "{0} with arguments(s) {1} did not contain '{2}'' in the stdout. Output was {3}.".format(self.ScriptPath, args, targetOutput, actualOutput) + + def checkOutputNotContains(self, args, targetOutput, actualOutput): + assert(targetOutput not in actualOutput), "{0} with arguments(s) {1} unexpectedly contained '{2}' in the stdout. Output was {3}.".format(self.ScriptPath, args, targetOutput, actualOutput) + + def checkOutputEqual(self, args, targetOutput, actualOutput): + assert(targetOutput == actualOutput), "{0} with arguments(s) {1} did not produce output equal to '{2}'. Output was {3}.".format(self.ScriptPath, args, targetOutput, actualOutput) + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Test code for input validation +############################################################################################################################################################################### +############################################################################################################################################################################### + +class MigrationScriptInputTestSuite(MigrationScriptTestSuite): + + def testMissingMandatoryArgsCausesExit(self): + print("Testing script behavior against missing arguments...") + + genericArgMissingStr = "At least one mandatory argument missing" + specificArgMissingStr = "{0} missing from command line arguments." + + # No args + print("Testing script behavior with no arguments...") + scriptNoArgsExitCode, scriptNoArgsStdout = runCommand(self.ScriptPath) + self.checkExitCodeEqual(None, EXIT_BAD_ARGS, scriptNoArgsExitCode, scriptNoArgsStdout) + self.checkOutputContains(None, genericArgMissingStr, scriptNoArgsStdout) + + # Test one missing arg at a time + for entry in self.BaseArguments: + missingStr = FlagsMap[entry] + print("Testing script behavior when missing argument {0}...".format(missingStr)) + + testArgs = self.getScriptArgumentsWithOverrides(None, [entry]) + + scriptMissingArgExitCode, scriptMissingArgStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, scriptMissingArgExitCode, scriptMissingArgStdout) + self.checkOutputContains(testArgs, specificArgMissingStr.format(missingStr), scriptMissingArgStdout) + + for entry in FlagsMap: + if FlagsMap[entry] != missingStr: + self.checkOutputNotContains(testArgs, specificArgMissingStr.format(FlagsMap[entry]), scriptMissingArgStdout) + + def testInvalidArgsCausesExit(self): + print("Testing script behavior against invalid arguments...") + + invalidArgStr = "{0} is not a valid entry for {1}" + + invalidTargetArg = ("--target", "invalidtarget") + invalidEnvArg = ("--environment", "invalidenv") + invalidSrcStorageTypeArg = ("--typesrc", "invalidstoragetypesrc") + invalidDestStorageTypeArg = ("--typedest", "invalidstoragetypedest") + invalidClientArg = ("--queryclient", "invalidqueryclient") + + # Check behavior with wrong arguments + for entry in [invalidTargetArg, invalidEnvArg, invalidSrcStorageTypeArg, invalidDestStorageTypeArg, invalidClientArg]: + invalidatedArgs = self.getScriptArgumentsWithOverrides([entry]) + print("Testing script behavior when argument {0} is invalid...".format(entry[0])) + + scriptInvalidArgExitCode, scriptInvalidArgStdout = runCommand(self.ScriptPath, invalidatedArgs) + + self.checkExitCodeEqual(invalidatedArgs, EXIT_BAD_ARGS, scriptInvalidArgExitCode, scriptInvalidArgStdout) + self.checkOutputContains(invalidatedArgs, invalidArgStr.format(entry[1], entry[0]), scriptInvalidArgStdout) + + def testInvalidCombinationsCausesExit(self): + print("Testing script behavior against invalid argument combinations...") + + # Non adl missing accountsrc + print("Testing script behavior when type source includes non-adl but no non-adl accounts are specified...") + wasbSrcTypeArg = ("--typesrc", "wasb") + removeAccountList = "--accountsrc" + + testArgs = self.getScriptArgumentsWithOverrides([wasbSrcTypeArg], [removeAccountList]) + nonAdlMissingAcctSrcExitCode, nonAdlMissingAcctSrcStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, nonAdlMissingAcctSrcExitCode, nonAdlMissingAcctSrcStdout) + self.checkOutputContains(testArgs, "Account types other than ADL set but AccountSrc missing from command line arguments.", nonAdlMissingAcctSrcStdout) + + # adl missing adl accounts + print("Testing script behavior when type source includes adl but no adl accounts are specified...") + adlSrcTypeArg = ("--typesrc", "adl") + removeAccountList = "--adlaccounts" + + testArgs = self.getScriptArgumentsWithOverrides([adlSrcTypeArg], [removeAccountList]) + adlMissingAdlAcctSrcExitCode, AdlMissingAdlAcctSrcStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, adlMissingAdlAcctSrcExitCode, AdlMissingAdlAcctSrcStdout) + self.checkOutputContains(testArgs, "ADL cannot be specified as a source type without also specifying ADL account names to be moved", AdlMissingAdlAcctSrcStdout) + + # adl and non default cloud + cloudOptions = ["china", "germany", "usgov"] + + # adl as src + print("Testing script behavior when type source includes adl but non default Azure Cloud is specified...") + adlSrcTypeArg = ("--typesrc", "adl") + adlAccountList = ("--adlaccounts", "foo") + for option in cloudOptions: + environmentOverrideArg = ("--environment", option) + testArgs = self.getScriptArgumentsWithOverrides([adlSrcTypeArg, adlAccountList, environmentOverrideArg]) + adlNotSupportedExitCode, adlNotSupportedStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, adlNotSupportedExitCode, adlNotSupportedStdout) + self.checkOutputContains(testArgs, "Cannot include Azure Data Lake as a source type when working in non-default cloud: {0}".format(option), adlNotSupportedStdout) + + # adl as target + print("Testing script behavior when type target is adl but non default Azure Cloud is specified...") + adlDestTypeArg = ("--typedest", "adl") + for option in cloudOptions: + environmentOverrideArg = ("--environment", option) + testArgs = self.getScriptArgumentsWithOverrides([adlDestTypeArg, environmentOverrideArg]) + adlNotSupportedExitCode, adlNotSupportedStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, adlNotSupportedExitCode, adlNotSupportedStdout) + self.checkOutputContains(testArgs, "Cannot include Azure Data Lake as the target type when working in non-default cloud: {0}".format(option), adlNotSupportedStdout) + + # account dest without account type + print("Testing script behavior when destination account is set but account type is missing...") + accountDestArg = ("--accountdest", "foo") + removeAccountTypeDest = ("--typedest") + + testArgs = self.getScriptArgumentsWithOverrides([accountDestArg], [removeAccountTypeDest]) + accountTypeMissingExitCode, accountTypeMissingAcctSrcStdout = runCommand(self.ScriptPath, testArgs) + + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, accountTypeMissingExitCode, accountTypeMissingAcctSrcStdout) + self.checkOutputContains(testArgs, "Destination account cannot be specified without destination account type", accountTypeMissingAcctSrcStdout) + + def testTooManyArgsCausesExit(self): + print("Testing script behavior when too many source values are set...") + + tooManyArgsStr = "Too many entries specified" + typeSrcArg = ("--typesrc", "wasb,adl") + + accountSrcArg = ("--accountsrc") + adlAccountSrcArg = ("--adlaccounts") + containerSrcArg = ("--containersrc") + pathSrcArg = ("--pathsrc") + + srcArgs = [accountSrcArg, adlAccountSrcArg, containerSrcArg, pathSrcArg] + for optionStr in srcArgs: + print("Testing script behavior when too many source values are set for {0}...".format(optionStr)) + option = [optionStr, (",".join([str(i) for i in range(MAX_SRC_ARGS+1)]))] + overrides = [typeSrcArg, option] + + for value in [item for item in srcArgs if item != optionStr]: + overrides.append([value, ARG_WILDCARD]) + + testArgs = self.getScriptArgumentsWithOverrides(overrides) + + tooManyArgsExitCode, tooManyArgsStdout = runCommand(self.ScriptPath, testArgs) + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, tooManyArgsExitCode, tooManyArgsStdout) + self.checkOutputContains(testArgs, tooManyArgsStr, tooManyArgsStdout) + + def testUnsupportedArgsCausesExit(self): + print("Testing script behavior when an unsupported argument is specified...") + + testArgs = self.getScriptArgumentsWithOverrides([("--unsupportedTest", "argument")]) + + unsupportedArgExitCode, unsupportedArgStdout = runCommand(self.ScriptPath, testArgs) + self.checkExitCodeEqual(testArgs, EXIT_BAD_ARGS, unsupportedArgExitCode, unsupportedArgStdout) + self.checkOutputContains(testArgs, "unsupportedTest is an unsupported flag.", unsupportedArgStdout) + + def testNoDestinationArgsCausesNoAction(self): + print("Testing script behavior when there is no migration to do...") + + testArgs = self.getScriptArgumentsWithOverrides() + noMigrationExitCode, noMigrationStdout = runCommand(self.ScriptPath, testArgs) + self.checkExitCodeEqual(testArgs, EXIT_NO_CHANGE, noMigrationExitCode, noMigrationStdout) + self.checkOutputContains(testArgs, "No destination attributes set. Nothing to do.", noMigrationStdout) + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Tests for query validation +############################################################################################################################################################################### +############################################################################################################################################################################### + +class MigrationScriptExecutionTestSuite(MigrationScriptTestSuite): + def __init__( + self, + Name, + ScriptPath, + Server, + DB, + User, + Pass, + Driver, + CleanupOnExit + ): + super().__init__(Name, ScriptPath, Server, DB, User, Pass, Driver, CleanupOnExit) + + self.TableSchema = {} + self.TableSchema[MOCKUP_FUNC_RU] = [("RESOURCE_URI", STR_TYPE), ("FUNC_ID", BIGINT_TYPE + ' ' + SET_PRMK)] + self.TableSchema[MOCKUP_DBS] = [("DB_LOCATION_URI", STR_TYPE), ("DB_ID", BIGINT_TYPE + ' ' + SET_PRMK)] + self.TableSchema[MOCKUP_SDS] = [("LOCATION", STR_TYPE), ("SD_ID", BIGINT_TYPE + ' ' + SET_PRMK)] + self.TableSchema[MOCKUP_SKEWED_COL_VALUE_LOC_MAP] = [("LOCATION", STR_TYPE), ("SD_ID", BIGINT_TYPE + ' ' + SET_PRMK)] + + # Make sure we are not about to test on a database with the tables already present + with self.getODBCConnection().cursor() as crsr: + for table in self.TableSchema: + self.checkTableNotExists(table, crsr) + + liveServer = ("--metastoreserver", self.Server) + liveDB = ("--metastoredatabase", self.DB) + liveUser = ("--metastoreuser", self.User) + livePassword = ("--metastorepassword", self.Pass) + # Override the default arguments so the queries run live + self.BaseArguments = self.getScriptArgumentsWithOverrides([liveServer, liveDB, liveUser, livePassword]) + self.SampleOverrides = [ + ("--accountsrc", "gopher"), + ("--adlaccounts", "gopher"), + ("--containersrc", "bravo"), + ("--pathsrc", ARG_WILDCARD), + ("--typesrc", "abfs,abfss,wasb,wasbs"), + ("--accountdest", "echo"), + ("--typedest", "wasbs"), + ] + + self.SampleData = open(path.join(ASSETS_DIR, URI_FILE), "r").read().splitlines() + + def __exit__(self, exc_type, exc_value, traceback): + if self.TablesCreated and self.CleanupOnExit: # Don't clean up the tables if we didn't make them + with self.getODBCConnection().cursor() as crsr: + self.dropURITables(crsr) + + super().__exit__(exc_type, exc_value, traceback) + + def checkMigrationScriptResult(self, testArgs, testExitCode, testStdout, expectedExitCode, expectedOutput, resultBefore, resultAfter): + self.checkExitCodeEqual(testArgs, expectedExitCode, testExitCode, testStdout) + self.checkOutputContains(testArgs, expectedOutput, testStdout) + self.checkURIResultMatch(testArgs, resultBefore, resultAfter, testStdout) + + def checkTableNotExists(self, table, crsr): + errorMsg = "Table {0} already exists in database {1} on server {2}. Tests will be run against this table so use a database where {0} does not exist.".format(table, self.DB, self.Server) + assert(not self.tableExistsQuery(table, crsr)), errorMsg + + def checkTableUnchanged(self, tableRows): + errorMsg = "Query result was unexpectely different from the original sample data. Query result was: {0}.".format(tableRows) + assert(listEqUnordered(tableRows, self.SampleData)), errorMsg + + def checkURIResultMatch(self, args, resultBefore, resultAfter, stdout): + errorMsg = "{0} with arguments: {1} unexpectely altered the did not produce the correct migration result. \ + URIs matching post-migration expression before execution were: {2} and after execution were: {3}. \ + Stdout was: {4}.".format( + self.ScriptPath, + args, + resultBefore, + resultAfter, + stdout + ) + assert(listEqUnordered(resultBefore, resultAfter)), errorMsg + + def createTable(self, table, columns, crsr): + queryStr = "create table {0} ({1});".format(table, ', '.join(colname + ' ' + coltype for colname, coltype in columns)) + print("Creating table {0} with query: {1}".format(table, queryStr)); + crsr.execute(queryStr) + + def dropTable(self, tableName, crsr): + print("Dropping table {0}.".format(tableName)) + queryStr = "drop table if exists {0}".format(tableName) + crsr.execute(queryStr) + + def dropURITables(self, crsr): + for tableName in self.TableSchema: + self.dropTable(tableName, crsr) + self.TablesCreated = False + + def getODBCConnection(self): + return connect('DRIVER={' + self.Driver + '};SERVER=' + self.Server + '.database.windows.net;PORT=1433;DATABASE=' + self.DB + ';UID=' + self.User + ';PWD=' + self.Pass) + + def getAllURIsFromTable(self, tableName=MOCKUP_SDS): + with self.getODBCConnection().cursor() as crsr: + queryStr = "select {0} as uri from {1}".format( + self.TableSchema[tableName][0][0], + tableName + ) + print("Gathering all URIs in {0}...".format(tableName)) + queryResult = crsr.execute(queryStr) + finalResult = ([] if queryResult.rowcount == 0 else [entry.uri for entry in queryResult.fetchall()]) + + return finalResult + + def getParameterizedURI(self, replacementKeys, args=None, cloudOverride=None): + if args is None: + args = dict(self.BaseArguments) + + genericURI = "{0}://{1}@{2}.%.{4}/{3}/%".format( + *[args[item] if item in args else '%' for item in replacementKeys ], cloudOverride or "%" + ) + return genericURI + + def getMatchingURIsFromTableWithID(self, tableName, genericURI): + with self.getODBCConnection().cursor() as crsr: + queryStr = "select {0}, {1} from {2} where {0} like ?".format( + self.TableSchema[tableName][0][0], + self.TableSchema[tableName][1][0], + tableName + ) + print("Gathering all entries in {0} that match query: {1}".format(tableName, queryStr).replace('?', genericURI)) + queryResult = crsr.execute(queryStr, genericURI) + finalResult = ([] if queryResult.rowcount == 0 else [tuple(entry) for entry in queryResult.fetchall()]) + + return finalResult + + def getTransformedURI(self, uri, args, cloudEndpoint): + newuri = uri + + if "--typedest" in args: + newuri = sub("(" + args["--typesrc"].replace(",", "|").replace(ARG_WILDCARD, ".*") + ")://", escape(args["--typedest"])+"://", newuri) + + if args["--typedest"] in "wasbs": + newuri = sub( escape(".") + "[a-z0-9.]+" + escape("/"), ".blob." + cloudEndpoint + "/", newuri) + elif args["--typedest"] in "abfss": + newuri = sub( escape(".") + "[a-z0-9.]+" + escape("/"), ".dfs." + cloudEndpoint + "/", newuri) + else: # adl + newuri = sub( escape(".") + "[a-z0-9.]+" + escape("/"), ".azuredatalakestore.net/", newuri) + + if "--containerdest" in args: + newuri = sub("://(" + args["--containersrc"].replace(",", "|").replace(ARG_WILDCARD, "[a-z0-9]+") + ")@", "://" + escape(args["--containerdest"]) + "@", newuri) + + if "--accountdest" in args: + newuri = sub("@(" + args["--accountsrc"].replace(",", "|").replace(ARG_WILDCARD, "[a-z0-9]+") + ")" + escape("."), "@" + escape(args["--accountdest"]) + ".", newuri) + + if "--pathdest" in args: + pathsubstr = newuri.split(".") + pathRegex = "/(" + args["--pathsrc"].replace(",", "|").replace(ARG_WILDCARD, ".*") + ")/" + + # Make sure the match is a prefix + if search(pathRegex, pathsubstr[-1][pathsubstr[-1].find('/'):len(pathsubstr[-1])]).start() == 0: + pathsubstr[-1] = sub( "/(" + args["--pathsrc"].replace(",", "|").replace(ARG_WILDCARD, ".*") + ")/", "/" + escape(args["--pathdest"]) + "/", pathsubstr[-1]) + + newuri = ".".join(pathsubstr) + + return newuri + + def getURITransformationOutput(self, args, cloudEndpoint): + matches = [] + srcFlags = ["--typesrc", "--containersrc", "--accountsrc", "--pathsrc"] + + # get all entries in sample data that match source uri + srcMatchURI = (self.getParameterizedURI(["--typesrc", "--containersrc", "--accountsrc", "--pathsrc"], args, cloudEndpoint)).replace("%", "*") + + for index, entry in enumerate(self.SampleData): + if( + (args["--typesrc"] == ARG_WILDCARD or any(acctype+"://" in entry for acctype in args["--typesrc"].split(','))) and + (args["--containersrc"] == ARG_WILDCARD or any("://"+container+"@" in entry for container in args["--containersrc"].split(','))) and + (args["--accountsrc"] == ARG_WILDCARD or any("@"+acc+"." in entry for acc in args["--accountsrc"].split(','))) and + (args["--pathsrc"] == ARG_WILDCARD or any(rootpathInURI(path, entry) for path in args["--pathsrc"].split(','))) and + cloudEndpoint in entry + ): + # Match found + matches.append( Match(index+1, entry, self.getTransformedURI(entry, args, cloudEndpoint)) ) + + return matches + + def insertIntoURITableWithIDs(self, table, uricol, idcol, data, crsr): + insertionString = ', '.join('('+str(index+1)+', \''+item+'\')' for index, item in enumerate(data)) + queryStr = "insert into {0} ({1}, {2}) values {3};".format(table, idcol, uricol, insertionString) + crsr.execute(queryStr) + + def loadDataIntoTables(self, crsr): + for table in self.TableSchema: + self.checkTableNotExists(table, crsr) + self.createTable(table, self.TableSchema[table], crsr) + self.insertIntoURITableWithIDs(table, self.TableSchema[table][0][0], self.TableSchema[table][1][0], self.SampleData, crsr) + self.TablesCreated = True + + def loadTables(self): + print("Dropping test tables and re-loading...") + with self.getODBCConnection().cursor() as crsr: + self.dropURITables(crsr) + self.loadDataIntoTables(crsr) + + def runMigrationTest(self, testArgs, tableName=MOCKUP_SDS, azureEnv="default", cloudEndpoint="core.windows.net"): + migrationResultURI = self.getParameterizedURI(["--typedest", "--containerdest", "--accountdest", "--pathdest"], testArgs, cloudEndpoint) + + # URIs that already matched the migration result before the migration runs + matchingURIsBeforeDryExec = self.getMatchingURIsFromTableWithID(tableName, migrationResultURI) + # A list of triples for URIs that match the src flags: index, uri, transformeduri + expectedURITransformationOutput = self.getURITransformationOutput(testArgs, cloudEndpoint) + # Migration source flag matches in the form of the output the script produces + expectedOutputResult = "\n".join([",".join([str(x) for x in item]) for item in expectedURITransformationOutput]) + + # Dry run: Check exit code, output match and no db change + print("Executing dry run...") + testExitCode, testStdout = runCommand(self.ScriptPath, testArgs) + # Since this was a dry run, table should not have any new URIs matching the migration result + matchingURIsAfterDryRunExec = self.getMatchingURIsFromTableWithID(tableName, migrationResultURI) + self.checkMigrationScriptResult(testArgs, testExitCode, testStdout, EXIT_DRY_RUN, expectedOutputResult, matchingURIsBeforeDryExec, matchingURIsAfterDryRunExec) + + # Live run: Check exit code, output match and exists db change + print("Executing live...") + testExitCode, testStdout = runCommand(self.ScriptPath, testArgs, ["--liverun"]) + matchingURIsAfterLiveRunExec = self.getMatchingURIsFromTableWithID(tableName, migrationResultURI) + # Since this was a live run, table should have the matching URIs from before exec, as well as some new ones as predicted by getURITransformationOutput() + matchingURIsBeforeLiveExec = matchingURIsBeforeDryExec + [(match.transformedentry, match.index) for match in expectedURITransformationOutput] + self.checkMigrationScriptResult(testArgs, testExitCode, testStdout, EXIT_SUCCESS, expectedOutputResult, matchingURIsBeforeLiveExec, matchingURIsAfterLiveRunExec) + + def tableExistsQuery(self, table, crsr): + existsQueryResult = crsr.execute("select OBJECT_ID('{0}') as result".format(table)); + return existsQueryResult.fetchone().result != None + + def testSampleMigrationInAllClouds(self): + self.loadTables() + cloudOptions = {} + cloudOptions["default"] = "core.windows.net" + cloudOptions["china"] = "core.chinacloudapi.cn" + cloudOptions["germany"] = "core.cloudapi.de" + cloudOptions["usgov"] = "core.usgovcloudapi.net" + + for option in cloudOptions: + print("Testing script behavior when executing a standard migration in Azure environment: {0}...".format(option)) + testArgs = self.getScriptArgumentsWithOverrides([("--environment", option)] + self.SampleOverrides) + self.runMigrationTest(testArgs, MOCKUP_SDS, option, cloudOptions[option]) + + def testSampleMigrationInAllTables(self): + self.loadTables() + tables = [MOCKUP_SDS, MOCKUP_DBS, MOCKUP_FUNC_RU, MOCKUP_SKEWED_COL_VALUE_LOC_MAP] + + for table in tables: + print("Testing script behavior when executing a standard migration against metastore table: {0}...".format(table)) + testArgs = self.getScriptArgumentsWithOverrides([("--target", table)] + self.SampleOverrides) + self.runMigrationTest(testArgs, table, "default", "core.windows.net") + + def testSampleMigrationInAllSupportedClients(self): + clientArgs = ["beeline", "sqlcmd"] + + for client in clientArgs: + print("Testing script behavior when executing a standard migration using query client: {0}...".format(client)) + self.loadTables() + testArgs = self.getScriptArgumentsWithOverrides([("--queryclient", client)] + self.SampleOverrides) + self.runMigrationTest(testArgs, MOCKUP_SDS, "default", "core.windows.net") + + def testAdlMigrations(self): + adlAccounts = ["gopher", "echo"] + adlAccountsAsStr = ",".join(adlAccounts) + + argOverrides = [ + ("--containersrc", ARG_WILDCARD), + ("--pathsrc", ARG_WILDCARD), + ] + + adlSrcArgs = [ + ("--typesrc", "adl"), + ("--adlaccounts", adlAccountsAsStr) + ] + + adlDestArgs = [ + ("--accountdest", "newadlacct"), + ("--typedest", "adl") + ] + + nonAdlSrcArgs = [ + ("--typesrc", "wasb"), + ("--accountsrc", "echo") + ] + + nonAdlDestArgs = [ + ("--accountdest", "newwasbacct"), + ("--typedest", "wasb") + ] + + print("Testing script behavior when executing migrations involving Azure Data Lake accounts...") + + print("Testing script behavior for ADL to ADL migration...") + self.loadTables() + testArgs = self.getScriptArgumentsWithOverrides(argOverrides + adlSrcArgs + adlDestArgs) + self.runMigrationTest(testArgs) + + print("Testing script behavior for ADL to non ADL migration...") + self.loadTables() + testArgs = self.getScriptArgumentsWithOverrides(argOverrides + adlSrcArgs + nonAdlDestArgs) + self.runMigrationTest(testArgs) + + print("Testing script behavior for non ADL to ADL migration...") + self.loadTables() + testArgs = self.getScriptArgumentsWithOverrides(argOverrides + nonAdlSrcArgs + adlDestArgs) + self.runMigrationTest(testArgs, cloudEndpoint=".azuredatalakestore.net") + + def testContainerPathPatternMatching(self): + argOverrides = [ + ("--containersrc", ARG_WILDCARD), + ("--typesrc", ARG_WILDCARD), + ("--accountsrc", ARG_WILDCARD), + ("--adlaccounts", ARG_WILDCARD), + ("--accountdest", "newwasbacct"), + ("--typedest", "wasb") + ] + + """ + warehouse/hivetables should move + warehouse/hive should not move + hive should not move + managed/table should stil end with hive + warehouse,managed/tables/hive should do a partial replace and a full replace + """ + print("Testing script behavior when executing migrations involving various possible container paths...") + pathOptions = ["warehouse/hivetables", "warehouse/hive", "hive", "managed/tables", "warehouse,managed/tables/hive", "managed/tables/hive,managed/tables"] + pathResult = "resultpath" + for option in pathOptions: + self.loadTables() + print("Testing script behavior when replacing path(s) {0} with path {1}".format(option, pathResult)) + pathArgs = [("--pathsrc", option), ("--pathdest", pathResult)] + testArgs = self.getScriptArgumentsWithOverrides(argOverrides + pathArgs) + self.runMigrationTest(testArgs) + + def testNonMatchingURIsUnchanged(self): + self.loadTables() + argOverrides = [ + ("--containersrc", "water"), + ("--typesrc", "abfss"), + ("--accountsrc", "xylophone"), + ("--adlaccounts", "yellow"), + ("--accountdest", "zebra"), + ("--typedest", "wasb") + ] + + print("Testing script behavior when no migration matches are expected...") + testArgs = self.getScriptArgumentsWithOverrides(argOverrides) + self.runMigrationTest(testArgs) + + def testAllMigrationAspects(self): + self.loadTables() + argOverrides = self.SampleOverrides + [ + ("--containerdest", "newctr"), + ("--pathdest", "newpath") + ] + + print("Testing script behavior when all possible migration parameters are specified...") + testArgs = self.getScriptArgumentsWithOverrides(argOverrides) + self.runMigrationTest(testArgs) + + def testSQLFailureCausesNoChange(self): + self.loadTables() + print("Testing script behavior when there is an unexpected failure during sql execution.") + + execution = [self.ScriptPath] + dctToList(self.getScriptArgumentsWithOverrides(self.SampleOverrides)) + ["--liverun"] + migrationProcess = Popen( + execution, + stderr = PIPE, + stdout = PIPE, + encoding = 'utf8' # Communicating with the script requires utf-8, which requires python 3.6 + ) + + migrationScriptPoll = poll() + migrationScriptPoll.register(migrationProcess.stdout) + while migrationProcess.returncode is None: + if migrationScriptPoll.poll(0.5): + stdout = migrationProcess.stdout.readline() + if "Writing migration results to table" in stdout: + migrationProcess.kill() # Kill the process during the write + migrationProcess.communicate() + + with self.getODBCConnection().cursor() as crsr: + crsr.execute("drop table if exists locationupdate;") + + # Select the URIs from MOCKUP_SDS to make sure no changes took effect despite killing the process in the middle of writing to MOCKUP_SDS + uriResult = self.getAllURIsFromTable() + self.checkTableUnchanged(uriResult) + + def testMigrationCommandIdempotent(self): + print("Testing migration script idempotency by executing same migration twice.") + self.loadTables() + testArgs = self.getScriptArgumentsWithOverrides(self.SampleOverrides) + + print("Running first execution...") + self.runMigrationTest(testArgs) + + # Set the sample data to reflect the change made to the database + print("Updating sample data to reflect migration result...") + self.SampleData = self.getAllURIsFromTable() + + print("Running second execution...") + self.runMigrationTest(testArgs) + + # This test alters the sample data to test how the data is impacted by repeated migrations + # So the sample data must be replaced on completion + self.SampleData = open(path.join(ASSETS_DIR, URI_FILE), "r").read().splitlines() + + def testMaximumSizeWhereClause(self): + self.loadTables() + longArgStr = "alpha,bravo,charlie,delta,echo,foxtrot,gopher,hedgehog,igloo,jupiter" + argOverrides = [ + ("--containersrc", longArgStr), + ("--pathsrc", longArgStr), + ("--accountsrc", longArgStr), + ("--adlaccounts", longArgStr), + ("--typesrc", "wasb,adl"), + ("--accountdest", "newacct"), + ("--typedest", "wasb") + ] + + print("Testing script behavior when the maximum number of arguments is passed to the script...") + testArgs = self.getScriptArgumentsWithOverrides(argOverrides) + self.runMigrationTest(testArgs) + # Need to check exit code 25 + +############################################################################################################################################################################### +############################################################################################################################################################################### +# Execution +############################################################################################################################################################################### +############################################################################################################################################################################### + +def main(): + arguments = getArgs() + + if INPUT_TEST_STR in arguments.suites: + with MigrationScriptInputTestSuite("InputTests", arguments.scriptPath) as testSuite: + + # Tests for argument syntax + testSuite.testMissingMandatoryArgsCausesExit() + testSuite.testInvalidArgsCausesExit() + testSuite.testInvalidCombinationsCausesExit() + testSuite.testUnsupportedArgsCausesExit() + + # Tests for argument semantics + testSuite.testTooManyArgsCausesExit() + testSuite.testNoDestinationArgsCausesNoAction() + + if EXEC_TEST_STR in arguments.suites: + with MigrationScriptExecutionTestSuite( + "ExecTests", + arguments.scriptPath, + arguments.server, + arguments.database, + arguments.username, + arguments.password, + arguments.driver, + arguments.cleanupOnExit + ) as testSuite: + + # Tests for alternative clouds, metastore tables and query clients + testSuite.testSampleMigrationInAllClouds() + testSuite.testSampleMigrationInAllTables() + testSuite.testSampleMigrationInAllSupportedClients() + + # Tests for special cases + testSuite.testAdlMigrations() + testSuite.testContainerPathPatternMatching() + testSuite.testNonMatchingURIsUnchanged() + + # Tests for script design + testSuite.testAllMigrationAspects() + testSuite.testSQLFailureCausesNoChange() + testSuite.testMigrationCommandIdempotent() + testSuite.testMaximumSizeWhereClause() + + +if __name__ == "__main__": + main() diff --git a/HiveMetastoreMigration/readme.md b/HiveMetastoreMigration/readme.md new file mode 100644 index 0000000..004c451 --- /dev/null +++ b/HiveMetastoreMigration/readme.md @@ -0,0 +1,130 @@ +# Metastore Migration Shell Script +A shell script for bulk-editing Azure Storage URIs inside a Hive metastore + +## Overview + +The metastore migration script is a tool for migrating URIs from one or more sources to a fixed destination. This script eliminates the need to perform manual migrations (such as `update table set location` statements) against the metastore. + +The purpose of this script is to allow for bulk-editing of Azure Storage URIs inside Hive metastores. Actions including but not limited to those described below will necessitate the use of this script. This script is also a prerequisite of sorts to running data migrations between storage accounts. + +### Use Cases and Motivation + +Some sample use cases for the script are as follows. These use cases give context as to why this script is important: + +1. Suppose WASB secure transfer has recently been enabled or disabled for a given storage account. Therefore, URIs that Hive queries search for will begin with wasb:// or wasbs:// as needed. However, URIs in the Hive Metastore will not undergo an automatic schema update. This update must be done explicitly. In this case, the Migration script can help by doing the following: + +"Move my WASB accounts Andy, Bob and Charles to WASBS" + +``` +> ./MigrateMetastore.sh +--metastoreserver myserver +--metastoredatabase mydb +--metastoreuser myuser +--metastorepassword mypw +--typesrc wasb +--accountsrc Andy,Bob,Charles +--containersrc '\*' +--pathsrc '*' +--target SDS +--queryclient beeline +--typedest wasbs +``` + + +2. In the not-too-distant future, new types of storage accounts will be available across different clouds. For example, ADLS gen1 is expected for release in the Azure US Govcloud next year. ADLs gen2 is also expected to GA across all clouds. The migration script can also help with tasks similar to the following: + +"I am a customer in the Azure US Govcloud and now that ADLS gen1 is available, I need to make some changes. I have two WASB accounts was1 and was2, each with containers Echo, Charlie and Zebra. I want to move all the tables under these accounts and containers to my new ADLS account 'fastnewaccount', without changing container names": + +``` +> ./MigrateMetastore.sh +--metastoreserver myserver +--metastoredatabase mydb +--metastoreuser myuser +--metastorepassword mypw +--typesrc wasb +--accountsrc was1,was2 +--containersrc Echo,Carlie,Delta +--pathsrc '*' +--target SDS +--queryclient beeline +--environment usgov +--typedest adl +--accountdest fastnewaccount +``` + +"I am a customer in public cloud and now that ADLS gen2 is available, I need to make some changes. I have two ADL accounts adl1 and adl2, each with containers Echo, Charlie and Zebra. I want to move all the tables under these accounts and containers to my new ADLS account 'fastnewaccount', but I want them to be under the container 'migration'": + +``` +> ./MigrateMetastore.sh +--metastoreserver myserver +--metastoredatabase mydb +--metastoreuser myuser +--metastorepassword mypw +--typesrc adl +--adlaccounts adl1,adl2 +--containersrc Echo,Carlie,Delta +--pathsrc '*' +--target SDS +--queryclient beeline +--typedest abfs +--accountdest fastnewaccount +--containerdest migration +``` + +3. Another upcoming GA feature in Azure is HDInsight 4.0. Some defaults in HDI4 will be changing with respect to where Hive data is stored. These new defaults will result in changes to the Hive metastore. For example, the old default path for managed (internal) tables is /warehouse/tablespace/, but in HDI4 the default will be /warehouse/managed/hive. Knowing this, the metastore migration script can do the following: + +"Move my tables stored in /warehouse/tablespace/ to /warehouse/managed/hive/ without squashing any subdirectories. However, make sure this is only done for containers named Test, Dev or Build": + +``` +> ./MigrateMetastore.sh +--metastoreserver myserver +--metastoredatabase mydb +--metastoreuser myuser +--metastorepassword mypw +--typesrc '*' +--adlaccounts '*' +--accountsrc '*' +--containersrc test,dev,build +--pathsrc warehouse/tablespace +--target SDS +--queryclient beeline +--pathdest warehouse/managed/hive +``` + +## Execution instructions + +The migration script takes a fairly large number of arguments, some of which are optional. It is important to note that it is _not_ necessary to execute this script from an HDInsight cluster: so far this script can be executed using `beeline` or `sqlcmd`. The script requires the username, password, databasename and servername of the Hive metastore in which URIs will be migrated. If the metastore to be edited is an internal metastore, its information is available via Ambari. To access the password for an internal Hive metastore, run this command while `ssh`'d into the cluster headnode: + +``` +sudo java -cp "/var/lib/ambari-agent/cred/lib/*" org.apache.ambari.server.credentialapi.CredentialUtil get javax.jdo.option.connectionpassword -provider jceks://file/etc/hive/conf/conf.server/hive-site.jceks +``` + +1. Make sure one of the supported query commandline tools is installed. If the script is being run from an HDInsight cluster, use `beeline`. Alternatively, the `sqlcmd` tool can be installed here: https://docs.microsoft.com/en-us/sql/tools/sqlcmd-utility?view=sql-server-2017 + +2. Download and execute ./MigrateMetastore.sh without any arguments and read through the doc string for detailed instructions + +3. Execute ./MigrateMetastore.sh with arguments in the style shown above (without linebreaks). The examples serve as a good start point. +* **Note**: the script does not perform any action unles the flag `--liverun` is also used. If this flag is omitted, the migration result will instead be written to stdout (`--liverun` also writes to stdout, but it writes to your database too!) + +## Testing instructions + +The migration script is also accompanied by a suite of tests. These tests make sure the script behaves as expected, and they also provide further examples. The tests themselves have a few dependencies that must be accounted for: + +1. The tests must run on Python **3.6**. run `python3 -V` to check python version. If the version is not 3.6, Python 3.6 can be installed at: https://www.python.org/downloads/release/python-360/. + +2. The tests validate the migration script results by using `pyodbc`. Use `pip` to install pyodbc with `pip3 install pyodbc`. If Python 3.6 is not the default python installation, you may need to install pip for python 3.6 with `python3.6 get-pip.py` before installing odbc. + +3. The tests require a SQL server and database instance, where a mockup metastore will be created from sample data. An actual Hive metastore is **not** to be used as the input for the tests. + +4. With all dependencies and inputs ready, execute the tests as follows: +``` +> Python3.6 MigrateMetastoreTests.py +./MigrateMetastoresh +--server myserver +--database mydatabase +--username user +--password mypw +--driver 'ODBC Driver 17 for SQL Server' +--testSuites All +--cleanup +``` \ No newline at end of file diff --git a/HiveMetastoreMigration/test-resources/mockup-mandatory-arguments b/HiveMetastoreMigration/test-resources/mockup-mandatory-arguments new file mode 100644 index 0000000..915ecfb --- /dev/null +++ b/HiveMetastoreMigration/test-resources/mockup-mandatory-arguments @@ -0,0 +1,10 @@ +--metastoreserver Server +--metastoredatabase Database +--metastoreuser Username +--metastorepassword Password +--typesrc abfs +--accountsrc AccountSrc +--containersrc ContainerSrc +--pathsrc RootpathSrc +--target SDS +--queryclient beeline \ No newline at end of file diff --git a/HiveMetastoreMigration/test-resources/mockup-metastore-uris b/HiveMetastoreMigration/test-resources/mockup-metastore-uris new file mode 100644 index 0000000..ddc7d92 --- /dev/null +++ b/HiveMetastoreMigration/test-resources/mockup-metastore-uris @@ -0,0 +1,85 @@ +wasb://alpha@echo.blob.core.chinacloudapi.cn/warehouse/hivetables/table1 +wasb://bravo@foxtrot.blob.core.chinacloudapi.cn/managed/tables/hive/table2 +wasb://charlie@gopher.blob.core.chinacloudapi.cn/warehouse/hivetables/table3 +wasb://delta@echo.blob.core.chinacloudapi.cn/managed/tables/hive/table4 +wasb://alpha@foxtrot.blob.core.chinacloudapi.cn/warehouse/hivetables/table5 +wasbs://bravo@gopher.blob.core.chinacloudapi.cn/managed/tables/hive/table6 +wasbs://charlie@echo.blob.core.chinacloudapi.cn/warehouse/hivetables/table7 +wasbs://delta@foxtrot.blob.core.chinacloudapi.cn/managed/tables/hive/table8 +wasbs://alpha@gopher.blob.core.chinacloudapi.cn/warehouse/hivetables/table9 +wasbs://bravo@echo.blob.core.chinacloudapi.cn/managed/tables/hive/table10 +abfs://charlie@foxtrot.dfs.core.chinacloudapi.cn/warehouse/hivetables/table11 +abfs://delta@gopher.dfs.core.chinacloudapi.cn/managed/tables/hive/table12 +abfs://alpha@echo.dfs.core.chinacloudapi.cn/warehouse/hivetables/table13 +abfs://bravo@foxtrot.dfs.core.chinacloudapi.cn/managed/tables/hive/table14 +abfs://charlie@gopher.dfs.core.chinacloudapi.cn/warehouse/hivetables/table15 +abfss://delta@echo.dfs.core.chinacloudapi.cn/managed/tables/hive/table16 +abfss://alpha@foxtrot.dfs.core.chinacloudapi.cn/warehouse/hivetables/table17 +abfss://bravo@gopher.dfs.core.chinacloudapi.cn/managed/tables/hive/table18 +abfss://charlie@echo.dfs.core.chinacloudapi.cn/warehouse/hivetables/table19 +abfss://delta@foxtrot.dfs.core.chinacloudapi.cn/managed/tables/hive/table20 +wasb://alpha@gopher.blob.core.usgovcloudapi.net/warehouse/hivetables/table21 +wasb://bravo@echo.blob.core.usgovcloudapi.net/managed/tables/hive/table22 +wasb://charlie@foxtrot.blob.core.usgovcloudapi.net/warehouse/hivetables/table23 +wasb://delta@gopher.blob.core.usgovcloudapi.net/managed/tables/hive/table24 +wasb://alpha@echo.blob.core.usgovcloudapi.net/warehouse/hivetables/table25 +wasbs://bravo@foxtrot.blob.core.usgovcloudapi.net/managed/tables/hive/table26 +wasbs://charlie@gopher.blob.core.usgovcloudapi.net/warehouse/hivetables/table27 +wasbs://delta@echo.blob.core.usgovcloudapi.net/managed/tables/hive/table28 +wasbs://alpha@foxtrot.blob.core.usgovcloudapi.net/warehouse/hivetables/table29 +wasbs://bravo@gopher.blob.core.usgovcloudapi.net/managed/tables/hive/table30 +abfs://charlie@echo.dfs.core.usgovcloudapi.net/warehouse/hivetables/table31 +abfs://delta@foxtrot.dfs.core.usgovcloudapi.net/managed/tables/hive/table32 +abfs://alpha@gopher.dfs.core.usgovcloudapi.net/warehouse/hivetables/table33 +abfs://bravo@echo.dfs.core.usgovcloudapi.net/managed/tables/hive/table34 +abfs://charlie@foxtrot.dfs.core.usgovcloudapi.net/warehouse/hivetables/table35 +abfss://delta@gopher.dfs.core.usgovcloudapi.net/managed/tables/hive/table36 +abfss://alpha@echo.dfs.core.usgovcloudapi.net/warehouse/hivetables/table37 +abfss://bravo@foxtrot.dfs.core.usgovcloudapi.net/managed/tables/hive/table38 +abfss://charlie@gopher.dfs.core.usgovcloudapi.net/warehouse/hivetables/table39 +abfss://delta@echo.dfs.core.usgovcloudapi.net/managed/tables/hive/table40 +wasb://alpha@foxtrot.blob.core.cloudapi.de/warehouse/hivetables/table41 +wasb://bravo@gopher.blob.core.cloudapi.de/managed/tables/hive/table42 +wasb://charlie@echo.blob.core.cloudapi.de/warehouse/hivetables/table43 +wasb://delta@foxtrot.blob.core.cloudapi.de/managed/tables/hive/table44 +wasb://alpha@gopher.blob.core.cloudapi.de/warehouse/hivetables/table45 +wasbs://bravo@echo.blob.core.cloudapi.de/managed/tables/hive/table46 +wasbs://charlie@foxtrot.blob.core.cloudapi.de/warehouse/hivetables/table47 +wasbs://delta@gopher.blob.core.cloudapi.de/managed/tables/hive/table48 +wasbs://alpha@echo.blob.core.cloudapi.de/warehouse/hivetables/table49 +wasbs://bravo@foxtrot.blob.core.cloudapi.de/managed/tables/hive/table50 +abfs://charlie@gopher.dfs.core.cloudapi.de/warehouse/hivetables/table51 +abfs://delta@echo.dfs.core.cloudapi.de/managed/tables/hive/table52 +abfs://alpha@foxtrot.dfs.core.cloudapi.de/warehouse/hivetables/table53 +abfs://bravo@gopher.dfs.core.cloudapi.de/managed/tables/hive/table54 +abfs://charlie@echo.dfs.core.cloudapi.de/warehouse/hivetables/table55 +abfss://delta@foxtrot.dfs.core.cloudapi.de/managed/tables/hive/table56 +abfss://alpha@gopher.dfs.core.cloudapi.de/warehouse/hivetables/table57 +abfss://bravo@echo.dfs.core.cloudapi.de/managed/tables/hive/table58 +abfss://charlie@foxtrot.dfs.core.cloudapi.de/warehouse/hivetables/table59 +abfss://delta@gopher.dfs.core.cloudapi.de/managed/tables/hive/table60 +wasb://alpha@echo.blob.core.windows.net/warehouse/hivetables/table61 +wasb://bravo@foxtrot.blob.core.windows.net/managed/tables/hive/table62 +wasb://charlie@gopher.blob.core.windows.net/warehouse/hivetables/table63 +wasb://delta@echo.blob.core.windows.net/managed/tables/hive/table64 +wasb://alpha@foxtrot.blob.core.windows.net/warehouse/hivetables/table65 +wasbs://bravo@gopher.blob.core.windows.net/managed/tables/hive/table66 +wasbs://charlie@echo.blob.core.windows.net/warehouse/hivetables/table67 +wasbs://delta@foxtrot.blob.core.windows.net/managed/tables/hive/table68 +wasbs://alpha@gopher.blob.core.windows.net/warehouse/hivetables/table69 +wasbs://bravo@echo.blob.core.windows.net/managed/tables/hive/table70 +abfs://charlie@foxtrot.dfs.core.windows.net/warehouse/hivetables/table71 +abfs://delta@gopher.dfs.core.windows.net/managed/tables/hive/table72 +abfs://alpha@echo.dfs.core.windows.net/warehouse/hivetables/table73 +abfs://bravo@foxtrot.dfs.core.windows.net/managed/tables/hive/table74 +abfs://charlie@gopher.dfs.core.windows.net/warehouse/hivetables/table75 +abfss://delta@echo.dfs.core.windows.net/managed/tables/hive/table76 +abfss://alpha@foxtrot.dfs.core.windows.net/warehouse/hivetables/table77 +abfss://bravo@gopher.dfs.core.windows.net/managed/tables/hive/table78 +abfss://charlie@echo.dfs.core.windows.net/warehouse/hivetables/table79 +abfss://delta@foxtrot.dfs.core.windows.net/managed/tables/hive/table80 +adl://alpha@gopher.azuredatalakestore.net/warehouse/hivetables/table81 +adl://bravo@echo.azuredatalakestore.net/managed/tables/hive/table82 +adl://charlie@foxtrot.azuredatalakestore.net/warehouse/hivetables/table84 +adl://delta@gopher.azuredatalakestore.net/managed/tables/hive/table84 +adl://alpha@echo.azuredatalakestore.net/warehouse/hivetables/table85 \ No newline at end of file