From 17c9e37f18119326bddd3973057baa415b6ad499 Mon Sep 17 00:00:00 2001 From: Daniel Sollien <62246179+daniel-sol@users.noreply.github.com> Date: Fri, 14 Jul 2023 11:31:03 +0200 Subject: [PATCH] Add better explanations to tables.ipynb (#195) --- examples/tables.ipynb | 401 +++++++++++++++++++++++++++++++++--------- 1 file changed, 313 insertions(+), 88 deletions(-) diff --git a/examples/tables.ipynb b/examples/tables.ipynb index 2bb9023e..5db0c801 100644 --- a/examples/tables.ipynb +++ b/examples/tables.ipynb @@ -8,6 +8,7 @@ "source": [ "import time\n", "import pandas as pd\n", + "import pyarrow as pa\n", "from fmu.sumo.explorer import Explorer, AggregatedTable\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", @@ -46,7 +47,8 @@ "outputs": [], "source": [ "# Get case by name (name is not guaranteed to be unique)\n", - "case = sumo.cases.filter(name=\"drogon_ahm-2023-02-22\")[0]\n" + "case = sumo.cases.filter(uuid=\"5e6bd69f-eaa2-49b7-b323-62a84d533051\")[0]\n", + "case.name\n" ] }, { @@ -57,6 +59,39 @@ "### Finding info about tables connected to case\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In sumo a big group of data are stored as tables. Any data that does not have spacial references will have to be in this format.
\n", + "This means that this is the solution for datatypes such as inplace volumes, and summary data from reservoir simulators.
\n", + "Even data that has spacial reference might be stored as a table. So to be general one can say that as long as the data is in a table like format,
like in an excel or oocalc spreadsheet/pandas dataframe/arrow table or something similar you can upload it to sumo as a table.
\n", + "\n", + "When it comes to metadata there is a very logical triplet for surfaces/polygons/points, these are name/tagname/content
\n", + "This is very influenced by how these are stored in rms, where they refer to name of horizon/representation/and content type.
\n", + "For example for a surface the name could be BCU (Base Cretaceous unconformity), the representation could be time or depth, and the content could be a
2D grid or point sets. \n", + "\n", + "For tables this is a bit more unclear, so one will for example find tables that have an empty tag name.
\n", + "Whatever there is of a convention here for now can be described as:
\n", + "\n", + "**inplace volumes coming out of rms**
\n", + "name: name of grid in rms
\n", + "tagname: vol
\n", + "content: volumes
\n", + "\n", + "**tables from eclipse** (results extracted with package ecl2df)
\n", + "name: name of datafile (but no -)
\n", + "tagname: datatype as according to ecl2df
\n", + "content: summary data will get timeseries, but for now the rest will get property
\n", + "There is a suggestion (issue in the fmu-dataio repo) that basically the content will be same as the tagname, meaning rft will be content rft
\n", + "pvt will be pvt etc. But here there are inconsistencies, e.g. relperm data is called satfunc in ecl2df, and it seems more logical to use relperm,
\n", + "or relativepermeability for this type. Here it would be good with input from the domain experts.\n", + " for name the convention adopted is that for data coming out of
\n", + "\n", + "**Any other table**
\n", + "Here there is really anarchy, so it is up to the end user to define this themselves when exporting with fmu-dataio.\n" + ] + }, { "cell_type": "code", "execution_count": null, @@ -67,7 +102,60 @@ "source": [ "tables = case.tables\n", "print(f\"Table names: {tables.names}\")\n", - "print(f\"Table tags: {tables.tagnames}\")" + "print(f\"Table tags: {tables.tagnames}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Fetching one table" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Most tables are stored during the running of an fmu run. So you will in that case have one version per realization (i.e per file path realization-n/iter-m)
\n", + "This means that it will be a slow process to get to results from one ensemble, meaning one iteration.
\n", + "Below is an example of how to fetch one table. General syntax:
\n", + "``table_collection = case.tables.filter(realization=, iteration=,``
\n", + "                     ``name=, tagname=)``
\n", + "Any of the input arguements are optional, so you could end up with more than one resulting table, and in theory even if you have
\n", + "used all arguements you could end up with several (long story, please ask). But you can check how many tables you have with using
\n", + "``len(table_collection)``
\n", + "To get to one table you can then access them using indexing as you would with a list in python:
\n", + "``table = table_collection[0]``
\n", + "If you want to access the actual object, you can access this with the methods to_pandas or to_arrow, and you also have access to the
\n", + "metadata via indexing as you would do a dictionary. The entire metadata can be access with using the attribute _metadata\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Filter using the key\n", + "one_table = tables.filter(realization=0, iteration=\"iter-0\", name=\"DROGON\", tagname=\"compdat\")[0]\n", + "# Give back the name and tag\n", + "print(f\"Found table {one_table.name}-{one_table.tagname}\")\n", + "# fetching the actual table as a pandas dataframe\n", + "print(one_table.name)\n", + "print(one_table.to_pandas.head())\n", + "# Access to the metadata\n", + "# If you know the metadata fields you can access the data directly on the object with square brackets\n", + "print(one_table[\"data\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since it is slow to aggregate the tables yourself, and that the tables then would become very large Sumo comes with a service that aggregates the tables, and then splits them up by column. We refer to these objects as aggregated tables even though they are both aggregated and split up. These come in different types, but the most general is the collection, where you get all realizations stacked on top of each other,
but you also have access to statistical aggregations such as mean, min,max, std, p10, and p90
\n", + "\n", + "Below is described how one uses these." ] }, { @@ -86,7 +174,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### The filtering way" + "### The filtering way\n", + "\n", + "General syntax:
\n", + "``selection = case.tables.filter(name=, tagname=, column=, aggregation=)``\n", + "\n", + "All these are optional as explained above, but you have to have the aggregation argument to get to an aggregated object.
\n", + "And if you leave out some of them you will end up with a collection of them, not just one" ] }, { @@ -94,7 +188,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Getting one aggregated table" + "##### Filtering to get to all aggregated tables with name of data file, for the drogon case it is called DROGON" ] }, { @@ -103,8 +197,8 @@ "metadata": {}, "outputs": [], "source": [ - "table = tables.filter(name=\"summary\", tagname=\"eclipse\", iteration=\"iter-0\", aggregation=\"collection\", column=\"FOPT\")[0]\n", - "table.to_pandas.head()" + "sim_tables = tables.filter(name=\"DROGON\", iteration=\"iter-0\", aggregation=\"collection\")\n", + "sim_tables.tagnames" ] }, { @@ -112,7 +206,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Access to the global variables the filtering way" + "## RFT" ] }, { @@ -121,17 +215,15 @@ "metadata": {}, "outputs": [], "source": [ - "# This functionality has been deactivated for now, will come back in next komodo release\n", - "# pd.DataFrame(table[\"fmu\"][\"iteration\"][\"parameters\"][\"GLOBVAR\"])\n", - "\n" + "rft_tables = sim_tables.filter(tagname=\"rft\")\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### For even more user friendly access to summary data\n" + "The rft table object now contains all the aggregated tables relating to the rft tag.
\n", + "The different object names you can access via the columns attribute, sort of similar to a pandas dataframe" ] }, { @@ -140,18 +232,15 @@ "metadata": {}, "outputs": [], "source": [ - "# Get case surfaces\n", - "summary = AggregatedTable(case, \"summary\", \"eclipse\", \"iter-0\")\n", - "summary.parameters\n", - "\n" + "rft_tables.columns" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "##### When you have read the parameters once, it will be faster, kept in memory of instance" + "To get to one spesific table you use the filtering option column=,
and fetch the element just like in a list
\n", + "Which gives access to one object, but this can be accessed both as pandas dataframe and pyarrow table" ] }, { @@ -160,15 +249,18 @@ "metadata": {}, "outputs": [], "source": [ - "summary.parameters" + "\n", + "pressure = rft_tables.filter(column=\"PRESSURE\")[0]\n", + "frame = pressure.to_pandas\n", + "print(f\"The following columns are in the pressure object {frame.columns.to_list()}\")\n", + "\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "##### Quite a lot of data in the global variables" + "#### After this it is easy to make a plot" ] }, { @@ -178,19 +270,38 @@ "outputs": [], "source": [ "\n", - "total_len= 0\n", - "for group_name in summary.parameters:\n", - " length = len(summary.parameters[group_name])\n", - " total_len += length\n", - " print(f\"{group_name} : {length}\")\n", - " if length != 100:\n", - " for var_name in summary.parameters[group_name]:\n", - " sub_length = len(summary.parameters[group_name][var_name])\n", - " print(f\" {var_name}: {sub_length}\")\n", - " total_len += sub_length\n", + "names = frame.WELL.unique()\n", + "dates = frame.DATE.unique()\n", + "fig, plots = plt.subplots(len(dates), len(names))\n", + "\n", + "\n", + "for i, date in enumerate(dates):\n", + " for j, well in enumerate(names):\n", + " data = frame.loc[(frame.DATE == date) & (frame.WELL == well)].sort_values(by=\"DEPTH\")\n", + " ax = plots[i, j]\n", + " if data.empty:\n", + " #get current axes\n", + " ax = plots[i, j]\n", + "\n", + " #hide x-axis\n", + " ax.get_xaxis().set_visible(False)\n", + "\n", + " #hide y-axis \n", + " ax.get_yaxis().set_visible(False)\n", + " ax.axis(\"off\")\n", + " else:\n", + " data[[\"DEPTH\", \"PRESSURE\"]].plot(ax=ax, x=\"PRESSURE\", y=\"DEPTH\")\n", + " ax.get_legend().remove()\n", + " if i == 0:\n", + " ax.set_title(well)\n", + " \n", + " ax.invert_yaxis()\n", + " \n", + " \n", + " \n", + "plt.show()\n", " \n", - "print(f\"{total_len} in total\")\n", - " " + " \n" ] }, { @@ -198,9 +309,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "##### Access to global variables\n", + "### For even more user friendly access to summary data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "##### Use the class Aggregated Table \n", + "\n", + "``COLLECTION = AggregatedTable(case, , )``\n", + "\n", + "Here ```` can be of collection, index, mean, min, max, p10 or p90\n", "\n", - "Calculate CV (coefficient of variation) for all global variables to see which ones are varied the most" + "This class gives you the aggregated tables that share name and tagname for one iteration as one object,
\n", + "so you don't need to know that what you are dealing with is a collection of objects\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Grid" ] }, { @@ -209,21 +339,16 @@ "metadata": {}, "outputs": [], "source": [ - "\n", - "globals = pd.DataFrame(summary.parameters[\"GLOBVAR\"])\n", - "std = globals.std()\n", - "mean = globals.mean()\n", - "selection = (mean > 0) & (std > 0)\n", - "cv = 100 * std.loc[selection] / mean.loc[selection]\n", - "cv.sort_values(ascending=False).round(2) " + "GRID = AggregatedTable(case, \"DROGON\", \"grid\", \"iter-0\")\n", + "GRID[\"PORO\"].to_pandas.plot(kind=\"hist\", y=\"PORO\")\n", + "GRID[\"PERMX\"].to_pandas.head()" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Add column with global var" + "### Equil" ] }, { @@ -232,17 +357,17 @@ "metadata": {}, "outputs": [], "source": [ - "FOPT = summary[\"FOPT\"].to_pandas\n", - "FOPT[\"RELPERM_INT_WO\"] = FOPT[\"REAL\"].replace(globals[\"RELPERM_INT_WO\"])\n", - "FOPT.head()" + "EQUIL = AggregatedTable(case, \"DROGON\", \"equil\", \"iter-0\")\n", + "CONTACT_TYPE = \"OWC\"\n", + "sns.boxplot(pd.pivot_table(EQUIL[CONTACT_TYPE].to_pandas, index=\"REAL\", columns=\"EQLNUM\", values=CONTACT_TYPE).values)\n", + "plt.show()" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Plot" + "### RELPERM" ] }, { @@ -251,17 +376,18 @@ "metadata": {}, "outputs": [], "source": [ - "sns.lineplot(data=FOPT, x=\"DATE\", y=\"FOPT\", size=\"REAL\", hue=\"RELPERM_INT_WO\", legend=False)\n", - "plt.xticks(rotation=45)\n", - "plt.show()" + "RELPERM = AggregatedTable(case, \"DROGON\", \"satfunc\", \"iter-0\")\n", + "\n", + "\n", + "KRW = pd.concat((RELPERM[\"KRW\"].to_pandas,RELPERM[\"SW\"].to_pandas ), axis=1).T.drop_duplicates().T\n", + "print(KRW.head())\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### If you prefer arrow to pandas" + "#### A plot" ] }, { @@ -270,71 +396,136 @@ "metadata": {}, "outputs": [], "source": [ - "summary[\"FOPT\"].to_arrow.schema" + "ax = sns.lineplot(KRW.loc[(KRW.KEYWORD == \"SWOF\")], x=\"SW\", y=\"KRW\", hue=\"SATNUM\", style=\"REAL\")\n", + "ax.legend(loc=\"right\", ncols=6, bbox_to_anchor=(2.1, 0.5))\n", + "plt.show()" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Inplace volumes" + "### Summary" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": true - }, + "metadata": {}, "outputs": [], "source": [ - "# Get case surfaces\n", - "inplace = AggregatedTable(case, \"geogrid\", \"vol\", \"iter-0\")\n", "\n", - "inplace[\"STOIIP_OIL\"].to_pandas.groupby([\"ZONE\", \"REAL\"])[\"STOIIP_OIL\"].agg(\"sum\")[\"Therys\"].plot(kind=\"hist\")" + "summary = AggregatedTable(case, \"DROGON\", \"summary\", \"iter-0\")\n", + "VECTOR_NAME = \"FOIP\"\n", + "ax = pd.pivot_table(summary[VECTOR_NAME].to_pandas, index=\"DATE\", columns=\"REAL\", values=VECTOR_NAME).dropna(axis=0).plot()\n", + "ax.get_legend().remove()\n", + "ax.set_label(VECTOR_NAME)\n", + "plt.show()\n", + "\n", + "\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "## Access speed\n", - "**NB only works in proper notebook, not via vscode**" + "### Compdat" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "COMPDAT = AggregatedTable(case, \"DROGON\", \"compdat\", iteration=\"iter-0\")\n", + "COMPDAT[\"KH\"].to_pandas" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Summary speedtest \n" + "### Wellcompletions" ] }, { "cell_type": "code", "execution_count": null, - "metadata": { - "scrolled": true - }, + "metadata": {}, "outputs": [], "source": [ - "start = time.perf_counter()\n", - "count = 0\n", - "for col_name in summary.columns[:20]:\n", - " vector = summary[col_name]\n", - " print(vector.to_pandas.head(1))\n", - " count += 1\n", - "print(f\"{count} cols in total time: {time.perf_counter() - start: .1f} s\")" + "COMPLETIONS = AggregatedTable(case, \"DROGON\", \"wellcompletiondata\", \"iter-0\")\n", + "KH = COMPLETIONS[\"KH\"].to_pandas\n", + "KH.head()\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "KH[\"ZONE_NR\"] = KH[\"ZONE\"].replace({value: key for key, value in dict(enumerate(KH[\"ZONE\"].unique().tolist())).items()})\n", + "MEAN_STD = pd.pivot_table(KH, index=[\"ZONE_NR\", \"ZONE\"], columns=\"WELL\", values=\"KH\", aggfunc=[\"mean\", \"std\"])\n", + "# KH.head()\n", + "MEAN_STD[(\"mean\", )][\"A1\"]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### A plot" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sns.scatterplot(KH, x=\"WELL\", y=\"ZONE\", hue=\"KH\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### FIPREPORTS" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "REPORTS = AggregatedTable(case, \"DROGON\", \"fipreports\", \"iter-0\")\n", + "print(REPORTS.columns)\n", + "REPORT_NAME = \"STOIIP_OIL\"\n", + "STOIIP = REPORTS[REPORT_NAME].to_pandas.dropna(subset=REPORT_NAME, axis=0)\n", + "STOIIP.head()\n" ] }, { - "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ - "### Inplace speedtest" + "### Faults\n", + "Seems to be something wrong with it, will need to have a look\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "FAULTS = AggregatedTable(case, \"DROGON\", \"faults\", \"iter-0\")\n", + "print(FAULTS.columns)\n", + "COMPLETE = pd.concat((FAULTS[\"I\"].to_pandas,FAULTS[\"J\"].to_pandas, FAULTS[\"K\"].to_pandas))\n", + "COMPLETE.head()" ] }, { @@ -342,20 +533,54 @@ "execution_count": null, "metadata": {}, "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The table index \n", + "\n", + "Throughout the examples of the aggregated tables you have probably noticed that the aggregated tables comes with additonal columns apart from the
\n", + "specific one that you have asked for. E.g. fetching an aggregated table for FOPT from summary will include the DATE column, and if you ask from PORO
from grid you will get these additional elements: [\"GLOBAL_INDEX\", \"I\", \"J\", \"K\"]. This is enables you to only download one object and you straight out of the box have all you need for plotting or analysis. These could be revised though, and here we want input from the users.\n", + " E.g. for now satfunc (which is relperm) does not include SW, should that be included? These are the definitions as of now,
\n", + " please give feedback on this!!\n", + "\n", + "DEFINITIONS = {
\n", + "  \"inplace_volumes\": [\"ZONE\", \"REGION\", \"FACIES\", \"LICENCE\"],
\n", + "  \"wellpicks\": [\"WELL\", \"HORIZON\"],
\n", + "  \"summary\": [\"DATE\"],
\n", + "  \"equil\": [\"EQLNUM\"],
\n", + "  \"compdat\": [\"WELL\", \"DATE\", \"DIR\"],
\n", + "  \"faults\": [\"NAME\", \"FACE\"],
\n", + "  \"fipreports\": [\"DATE\", \"FIPNAME\", \"REGION\"],
\n", + "  \"grid\": [\"GLOBAL_INDEX\", \"I\", \"J\", \"K\"],
\n", + "  \"pillars\": [\"PILLAR\"],
\n", + "  \"pvt\": [\"PVTNUM\", \"KEYWORD\"],
\n", + "  \"rft\": [\"WELL\", \"DATE\", \"DEPTH\"],
\n", + "  \"satfunc\": [\"SATNUM\", \"KEYWORD\"],
\n", + "  \"wellcompletiondata\": [\"WELL\", \"DATE\", \"ZONE\"],
\n", + "}" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, "source": [ - "start = time.perf_counter()\n", - "count = 0\n", - "for col_name in inplace.columns[:20]:\n", - " col = inplace[col_name]\n", - " print(col.to_pandas.head(1))\n", - " count += 1\n", - "print(f\"{count} cols in total time: {time.perf_counter() - start: .1f} s\")" + "##### Access to global variables\n", + "\n", + "This is now under reconstruction, so for now you will not have access to these, this is because the global variables where stored as metadata,
\n", + "and working with big datasets as Snorre showed that this was not a good solution. So we are rewriting the code so that they will be stored
\n", + "as metadata.
\n", + "\n", + "Will be fixed in august 2023" ] } ], "metadata": { "kernelspec": { - "display_name": "venv", + "display_name": "3.8.10", "language": "python", "name": "python3" },