Skip to content

Commit

Permalink
Release/511 (#676)
Browse files Browse the repository at this point in the history
* optimized EMR imports for less chance of import errors

* bump versions

* update release notes

* drop deprecated setup.py

* Endpoint tests (#637)

* example databricks serve notebooks and cluster creation

* updated healthcecks

* databricks serve + john snow labs endpoints module

* `clean_cluster` and `write_db_credentials` parameters for db cluster creation

* databricks serve + john snow labs endpoints docs + typo fixes

* databricks serve + john snow labs endpoints docs + typo fixes

* updated tests

* updated tests

* fix block_until_deployed bug

* fix block_until_deployed bug

* support for GPU and nlu.predict params in endpoints

* bump versions

* Docs update

* update notebooks

* endpoint test job generator

* improved list_db_runtime_versions

* multi cluster testing endpoints

* Support for submitting notebook to databricks and various utils & refactor to databricks utils

* db test refactor into  db_info/endpoint/hdfs/submit tests

* get_or_create_test_cluster() and varous more db testing utils, pytest.ini and non-verbose parameterization of tests

* add docs for notebook execution
  • Loading branch information
C-K-Loan authored Oct 1, 2023
1 parent 05d8457 commit a4c0f4f
Show file tree
Hide file tree
Showing 21 changed files with 954 additions and 450 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 32 additions & 0 deletions docs/en/jsl/databricks_utils.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,38 @@ And after a while you can see the results
![databricks_cluster_submit_raw.png](/assets/images/jsl_lib/databricks_utils/submit_raw_str_result.png)


### Run a local Python Notebook in Databricks

Provide the path to a notebook on your localhost, it will be copied to HDFS and executed by the Databricks cluster.
You need to provide a destination path to your Workspace, where the notebook will be copied to and you have write access to.
A common pattern that should work is`/Users/<[email protected]>/test.ipynb`

```python
local_nb_path = "path/to/my/notebook.ipynb"
remote_dst_path = "/Users/[email protected]/test.ipynb"

# notebook.ipynb will run on databricks, url will be printed
nlp.run_in_databricks(
local_nb_path,
databricks_host=host,
databricks_token=token,
run_name="Notebook Test",
dst_path=remote_dst_path,
)
```

This could be your input notebook

![databricks_cluster_submit_notebook.png](/assets/images/jsl_lib/databricks_utils/submit_notebook.png)

A URL where you can monitor the run will be printed, which will look like this

![databricks_cluster_submit_notebook_result.png](/assets/images/jsl_lib/databricks_utils/submit_notebook_result.png)





### Run a Python Function in Databricks

Define a function, which will be written to a local file, copied to HDFS and executed by the Databricks cluster.
Expand Down
62 changes: 62 additions & 0 deletions docs/en/jsl/jsl_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,27 @@ sidebar:

See [Github Releases](https://github.com/JohnSnowLabs/johnsnowlabs/releases) for detailed information on Release History and Features.



## 5.1.1
Release date: 1-10-2023

The John Snow Labs 5.1.0 Library released with the following pre-installed and recommended dependencies


| Library | Version |
|---------------------------------------------------------------------------------|---------|
| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/spark_ocr_versions/ocr_release_notes) | `5.0.1` |
| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/licensed_annotators) | `5.1.1` |
| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/financial_release_notes) | `1.X.X` |
| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/legal_release_notes) | `1.X.X` |
| [NLU](https://github.com/JohnSnowLabs/nlu/releases) | `5.0.1` |
| [Spark-NLP-Display](https://sparknlp.org/docs/en/display) | `4.4` |
| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/) | `5.1.1` |
| [Pyspark](https://spark.apache.org/docs/latest/api/python/) | `3.1.2` |



## 5.1.0
Release date: 25-09-2023

Expand Down Expand Up @@ -91,6 +112,47 @@ The John Snow Labs 5.0.7 Library released with the following pre-installed and r



## 5.0.8
Release date: 11-09-2023

The John Snow Labs 5.0.5 Library released with the following pre-installed and recommended dependencies


| Library | Version |
|---------------------------------------------------------------------------------|---------|
| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/ocr_release_notes) | `5.0.0` |
| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/licensed_annotators) | `5.0.2` |
| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/financial_release_notes) | `1.X.X` |
| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/legal_release_notes) | `1.X.X` |
| [NLU](https://github.com/JohnSnowLabs/nlu/releases) | `5.0.1` |
| [Spark-NLP-Display](https://nlp.johnsnowlabs.com/docs/en/jsl/display) | `4.4` |
| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/) | `5.0.2` |
| [Pyspark](https://spark.apache.org/docs/latest/api/python/) | `3.1.2` |





## 5.0.7
Release date: 3-09-2023

The John Snow Labs 5.0.5 Library released with the following pre-installed and recommended dependencies


| Library | Version |
|---------------------------------------------------------------------------------|---------|
| [Visual NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/ocr_release_notes) | `5.0.0` |
| [Enterprise NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/licensed_annotators) | `5.0.2` |
| [Finance NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/financial_release_notes) | `1.X.X` |
| [Legal NLP](https://nlp.johnsnowlabs.com/docs/en/jsl/legal_release_notes) | `1.X.X` |
| [NLU](https://github.com/JohnSnowLabs/nlu/releases) | `5.0.0` |
| [Spark-NLP-Display](https://nlp.johnsnowlabs.com/docs/en/jsl/display) | `4.4` |
| [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp/releases/) | `5.0.2` |
| [Pyspark](https://spark.apache.org/docs/latest/api/python/) | `3.1.2` |




## 5.0.6
Release date: 03-09-2023

Expand Down
119 changes: 108 additions & 11 deletions johnsnowlabs/auto_install/databricks/install_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ def list_db_runtime_versions(db: DatabricksAPI):
# pprint(versions)
for version in versions["versions"]:
print(version["key"])
print(version["name"])
name = version["name"]
# version_regex = r'[0-9].[0-9].[0-9]'

spark_version = re.findall(r"Apache Spark [0-9].[0-9]", version["name"])
Expand All @@ -166,14 +166,15 @@ def list_db_runtime_versions(db: DatabricksAPI):
has_gpu = len(re.findall("GPU", version["name"])) > 0
if spark_version:
spark_version = spark_version + ".x"
print(LibVersion(spark_version).as_str(), has_gpu, scala_version)


def list_clusters(db: DatabricksAPI):
clusters = db.cluster.list_clusters(headers=None)
pprint(clusters)
print(clusters)
return clusters
version = LibVersion(spark_version).as_str()
print(
f"name={name}\n"
f"version={version}\n"
f"has_gpu={has_gpu}\n"
f"scala_version={scala_version}\n"
f"spark_version={spark_version}\n"
f"{'=' * 25}"
)


def list_cluster_lib_status(db: DatabricksAPI, cluster_id: str):
Expand Down Expand Up @@ -388,13 +389,23 @@ def copy_lib_to_dbfs_cluster(
return copy_from_local_to_hdfs(db, local_path=local_path, dbfs_path=dbfs_path)


def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str):
def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str, timeout=900):
# https://docs.databricks.com/dev-tools/api/latest/clusters.html#clusterclusterstate
import time

while 1:
start_time = time.time()

while True:
elapsed_time = time.time() - start_time
if elapsed_time > timeout:
print(
"Timeout reached while waiting for the cluster to be running. Check cluster UI."
)
return False

time.sleep(5)
status = DatabricksClusterStates(db.cluster.get_cluster(cluster_id)["state"])

if status == DatabricksClusterStates.RUNNING:
return True
elif status in [
Expand All @@ -414,3 +425,89 @@ def wait_till_cluster_running(db: DatabricksAPI, cluster_id: str):

def restart_cluster(db: DatabricksAPI, cluster_id: str):
db.cluster.restart_cluster(cluster_id=cluster_id)


####
def list_clusters(db: DatabricksAPI):
"""Lists all clusters."""
clusters = db.cluster.list_clusters()
return clusters


def cluster_exist_with_name_and_runtime(
db: DatabricksAPI, name: str, runtime: str
) -> bool:
"""Checks if a cluster with a specific name and runtime exists."""
clusters = list_clusters(db)
for cluster in clusters.get("clusters", []):
if cluster["cluster_name"] == name and runtime in cluster["spark_version"]:
return True
return False


def does_cluster_exist_with_id(db: DatabricksAPI, cluster_id: str) -> bool:
"""Checks if a cluster with a specific ID exists."""
try:
cluster_info = db.cluster.get_cluster(cluster_id=cluster_id)
if cluster_info and "cluster_id" in cluster_info:
return True
except Exception as e:
# Handle specific exceptions based on the Databricks API error responses
pass
return False


def get_cluster_id(db: DatabricksAPI, cluster_name: str, runtime: str) -> str:
"""
Retrieves the cluster ID based on the cluster name and runtime.
If there are multiple candidates:
- Returns any that's in the 'RUNNING' state if there is one.
- If none are in the 'RUNNING' state, returns any.
"""
clusters = list_clusters(db)
running_clusters = []
other_clusters = []

for cluster in clusters.get("clusters", []):
if (
cluster["cluster_name"] == cluster_name
and runtime in cluster["spark_version"]
):
if cluster["state"] == "RUNNING":
running_clusters.append(cluster["cluster_id"])
else:
other_clusters.append(cluster["cluster_id"])

if running_clusters:
return running_clusters[0]
elif other_clusters:
return other_clusters[0]
else:
raise Exception(
f"No cluster found with name {cluster_name} and runtime {runtime}"
)


def _get_cluster_id(db: DatabricksAPI, cluster_name: str) -> str:
"""
Retrieves the cluster ID based on the cluster name.
"""
clusters = list_clusters(db)
running_clusters = []
other_clusters = []

for cluster in clusters.get("clusters", []):
if cluster["cluster_name"] == cluster_name:
# if cluster["state"] != "TERMINATED":
other_clusters.append(cluster["cluster_id"])

if running_clusters:
return running_clusters[0]
elif other_clusters:
return other_clusters[0]
else:
return None # Return None if no cluster found with the given name


#############
Loading

0 comments on commit a4c0f4f

Please sign in to comment.