Release/511 (#676)

* optimized EMR imports for less chance of import errors * bump versions * update release notes * drop deprecated setup.py * Endpoint tests (#637) * example databricks serve notebooks and cluster creation * updated healthcecks * databricks serve + john snow labs endpoints module * `clean_cluster` and `write_db_credentials` parameters for db cluster creation * databricks serve + john snow labs endpoints docs + typo fixes * databricks serve + john snow labs endpoints docs + typo fixes * updated tests * updated tests * fix block_until_deployed bug * fix block_until_deployed bug * support for GPU and nlu.predict params in endpoints * bump versions * Docs update * update notebooks * endpoint test job generator * improved list_db_runtime_versions * multi cluster testing endpoints * Support for submitting notebook to databricks and various utils & refactor to databricks utils * db test refactor into db_info/endpoint/hdfs/submit tests * get_or_create_test_cluster() and varous more db testing utils, pytest.ini and non-verbose parameterization of tests * add docs for notebook execution
JohnSnowLabs · Oct 1, 2023 · a4c0f4f · a4c0f4f
1 parent 05d8457
commit a4c0f4f
Show file tree

Hide file tree

Showing 21 changed files with 954 additions and 450 deletions.
diff --git a/docs/assets/images/jsl_lib/databricks_utils/submit_notebook.png b/docs/assets/images/jsl_lib/databricks_utils/submit_notebook.png
diff --git a/docs/assets/images/jsl_lib/databricks_utils/submit_notebook_result.png b/docs/assets/images/jsl_lib/databricks_utils/submit_notebook_result.png
diff --git a/docs/en/jsl/databricks_utils.md b/docs/en/jsl/databricks_utils.md
@@ -100,6 +100,38 @@ And after a while you can see the results
 ![databricks_cluster_submit_raw.png](/assets/images/jsl_lib/databricks_utils/submit_raw_str_result.png)
 
 
+### Run a local Python Notebook in Databricks
+
+Provide the path to a notebook on your localhost, it will be copied to HDFS and executed by the Databricks cluster.
+You need to provide a destination path to your Workspace, where the notebook will be copied to and you have write access to.
+A common pattern that should work is`/Users/<[email protected]>/test.ipynb`
+
+```python
+local_nb_path = "path/to/my/notebook.ipynb"
+remote_dst_path = "/Users/[email protected]/test.ipynb"
+
+# notebook.ipynb will run on databricks, url will be printed
+nlp.run_in_databricks(
+  local_nb_path,
+  databricks_host=host,
+  databricks_token=token,
+  run_name="Notebook Test",
+  dst_path=remote_dst_path,
+)
+```
+
+This could be your input notebook
+
+![databricks_cluster_submit_notebook.png](/assets/images/jsl_lib/databricks_utils/submit_notebook.png)
+
+A URL where you can monitor the run will be printed, which will look like this
+
+![databricks_cluster_submit_notebook_result.png](/assets/images/jsl_lib/databricks_utils/submit_notebook_result.png)
+
+
+
+
+
 ### Run a Python Function in Databricks
 
 Define a function, which will be written to a local file, copied to HDFS and executed by the Databricks cluster.

diff --git a/docs/en/jsl/jsl_release_notes.md b/docs/en/jsl/jsl_release_notes.md
@@ -15,6 +15,27 @@ sidebar:
 
 See [Github Releases](https://github.com/JohnSnowLabs/johnsnowlabs/releases) for detailed information on Release History and Features.
 
+
+
+## 5.1.1
+Release date: 1-10-2023
+
+The John Snow Labs 5.1.0 Library released with the following pre-installed and recommended dependencies
+
+
+| Library                                                                         | Version |
+|---------------------------------------------------------------------------------|---------|
+| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/spark_ocr_versions/ocr_release_notes)        | `5.0.1` |
+| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators)  | `5.1.1` |
+| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/financial_release_notes) | `1.X.X` |
+| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/legal_release_notes)       | `1.X.X` |
+| [NLU](https://github.com/JohnSnowLabs/nlu/releases)                             | `5.0.1` |
+| [Spark-NLP-Display](https://sparknlp.org/docs/en/display)           | `4.4`   |
+| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/)                | `5.1.1` |
+| [Pyspark](https://spark.apache.org/docs/latest/api/python/)                     | `3.1.2` |
+
+
+
 ## 5.1.0
 Release date: 25-09-2023
 
@@ -91,6 +112,47 @@ The John Snow Labs 5.0.7 Library released with the following pre-installed and r
 
 
 
+## 5.0.8
+Release date: 11-09-2023
+
+The John Snow Labs 5.0.5 Library released with the following pre-installed and recommended dependencies
+
+
+| Library                                                                         | Version |
+|---------------------------------------------------------------------------------|---------|
+| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/ocr_release_notes)        | `5.0.0` |
+| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/licensed_annotators)  | `5.0.2` |
+| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/financial_release_notes) | `1.X.X` |
+| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/legal_release_notes)       | `1.X.X` |
+| [NLU](https://github.com/JohnSnowLabs/nlu/releases)                             | `5.0.1` |
+| [Spark-NLP-Display](https://nlp.johnsnowlabs.com/docs/en/jsl/display)           | `4.4`   |
+| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/)                | `5.0.2` |
+| [Pyspark](https://spark.apache.org/docs/latest/api/python/)                     | `3.1.2` |
+
+
+
+
+
+## 5.0.7
+Release date: 3-09-2023
+
+The John Snow Labs 5.0.5 Library released with the following pre-installed and recommended dependencies
+
+
+| Library                                                                         | Version |
+|---------------------------------------------------------------------------------|---------|
+| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/ocr_release_notes)        | `5.0.0` |
+| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/licensed_annotators)  | `5.0.2` |
+| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/financial_release_notes) | `1.X.X` |
+| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/legal_release_notes)       | `1.X.X` |
+| [NLU](https://github.com/JohnSnowLabs/nlu/releases)                             | `5.0.0` |
+| [Spark-NLP-Display](https://nlp.johnsnowlabs.com/docs/en/jsl/display)           | `4.4`   |
+| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/)                | `5.0.2` |
+| [Pyspark](https://spark.apache.org/docs/latest/api/python/)                     | `3.1.2` |
+
+
+
+
 ## 5.0.6
 Release date: 03-09-2023
 

diff --git a/johnsnowlabs/auto_install/databricks/install_utils.py b/johnsnowlabs/auto_install/databricks/install_utils.py
@@ -154,7 +154,7 @@ def list_db_runtime_versions(db: DatabricksAPI):
     # pprint(versions)
     for version in versions["versions"]:
         print(version["key"])
-        print(version["name"])
+        name = version["name"]
         # version_regex = r'[0-9].[0-9].[0-9]'
 
         spark_version = re.findall(r"Apache Spark [0-9].[0-9]", version["name"])
@@ -166,14 +166,15 @@ def list_db_runtime_versions(db: DatabricksAPI):
         has_gpu = len(re.findall("GPU", version["name"])) > 0
         if spark_version:
             spark_version = spark_version + ".x"
-        print(LibVersion(spark_version).as_str(), has_gpu, scala_version)
-
-
-def list_clusters(db: DatabricksAPI):
-    clusters = db.cluster.list_clusters(headers=None)
-    pprint(clusters)
-    print(clusters)
-    return clusters
+        version = LibVersion(spark_version).as_str()
+        print(
+            f"name={name}\n"
+            f"version={version}\n"
+            f"has_gpu={has_gpu}\n"
+            f"scala_version={scala_version}\n"
+            f"spark_version={spark_version}\n"
+            f"{'=' * 25}"
+        )
 
 
 def list_cluster_lib_status(db: DatabricksAPI, cluster_id: str):
@@ -388,13 +389,23 @@ def copy_lib_to_dbfs_cluster(
     return copy_from_local_to_hdfs(db, local_path=local_path, dbfs_path=dbfs_path)
 
 
-def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str):
+def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str, timeout=900):
     # https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate
     import time
 
-    while 1:
+    start_time = time.time()
+
+    while True:
+        elapsed_time = time.time() - start_time
+        if elapsed_time > timeout:
+            print(
+                "Timeout reached while waiting for the cluster to be running. Check cluster UI."
+            )
+            return False
+
         time.sleep(5)
         status = DatabricksClusterStates(db.cluster.get_cluster(cluster_id)["state"])
+
         if status == DatabricksClusterStates.RUNNING:
             return True
         elif status in [
@@ -414,3 +425,89 @@ def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str):
 
 def restart_cluster(db: DatabricksAPI, cluster_id: str):
     db.cluster.restart_cluster(cluster_id=cluster_id)
+
+
+####
+def list_clusters(db: DatabricksAPI):
+    """Lists all clusters."""
+    clusters = db.cluster.list_clusters()
+    return clusters
+
+
+def cluster_exist_with_name_and_runtime(
+    db: DatabricksAPI, name: str, runtime: str
+) -> bool:
+    """Checks if a cluster with a specific name and runtime exists."""
+    clusters = list_clusters(db)
+    for cluster in clusters.get("clusters", []):
+        if cluster["cluster_name"] == name and runtime in cluster["spark_version"]:
+            return True
+    return False
+
+
+def does_cluster_exist_with_id(db: DatabricksAPI, cluster_id: str) -> bool:
+    """Checks if a cluster with a specific ID exists."""
+    try:
+        cluster_info = db.cluster.get_cluster(cluster_id=cluster_id)
+        if cluster_info and "cluster_id" in cluster_info:
+            return True
+    except Exception as e:
+        # Handle specific exceptions based on the Databricks API error responses
+        pass
+    return False
+
+
+def get_cluster_id(db: DatabricksAPI, cluster_name: str, runtime: str) -> str:
+    """
+    Retrieves the cluster ID based on the cluster name and runtime.
+
+    If there are multiple candidates:
+    - Returns any that's in the 'RUNNING' state if there is one.
+    - If none are in the 'RUNNING' state, returns any.
+    """
+    clusters = list_clusters(db)
+    running_clusters = []
+    other_clusters = []
+
+    for cluster in clusters.get("clusters", []):
+        if (
+            cluster["cluster_name"] == cluster_name
+            and runtime in cluster["spark_version"]
+        ):
+            if cluster["state"] == "RUNNING":
+                running_clusters.append(cluster["cluster_id"])
+            else:
+                other_clusters.append(cluster["cluster_id"])
+
+    if running_clusters:
+        return running_clusters[0]
+    elif other_clusters:
+        return other_clusters[0]
+    else:
+        raise Exception(
+            f"No cluster found with name {cluster_name} and runtime {runtime}"
+        )
+
+
+def _get_cluster_id(db: DatabricksAPI, cluster_name: str) -> str:
+    """
+    Retrieves the cluster ID based on the cluster name.
+    """
+    clusters = list_clusters(db)
+    running_clusters = []
+    other_clusters = []
+
+    for cluster in clusters.get("clusters", []):
+        if cluster["cluster_name"] == cluster_name:
+            # if cluster["state"] != "TERMINATED":
+            other_clusters.append(cluster["cluster_id"])
+
+    if running_clusters:
+        return running_clusters[0]
+    elif other_clusters:
+        return other_clusters[0]
+    else:
+        return None  # Return None if no cluster found with the given name
+
+
+#############