🚧 📝 resolve todo in section: Align Selected Data Sets: Align Columns …

…-> Field
devvyn · Aug 28, 2018 · f039321 · f039321
1 parent be2f8bf
commit f039321
Showing 1 changed file with 45 additions and 28 deletions.
diff --git a/notebook/projects/2016-sweep-vs-tiller/2016-sweep-vs-tiller.ipynb b/notebook/projects/2016-sweep-vs-tiller/2016-sweep-vs-tiller.ipynb
@@ -8473,9 +8473,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "It's clear to my eyes that in most cases, ***site*** and ***crop*** have been concatenated together to produce the value in ***field*** or ***field_name***.\n",
+    "It's clear to my eyes that in some cases, ***site*** and ***crop*** have been concatenated together to produce the value in ***field***.\n",
     "\n",
-    "- ***site*** + ***crop*** = ***field***(***_name***)\n",
+    "- ***site*** + ***crop*** = ***field***\n",
     "\n",
     "Primary data:\n",
     "\n",
@@ -8485,15 +8485,11 @@
     "Mixed data:\n",
     "\n",
     "- ***field***\n",
-    "- ***field_name***\n",
     "\n",
-    "Therefore, in those cases, I can safely disregard those derived columns and rely on the the more normalized forms for indexing. That is to say I think it's most beneficial to use ***crop*** and ***site*** whenever possible."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
+    "Therefore, in samples corresponding to purely derivative ***field*** values, I can safely disregard those derived columns and rely on the the more normalized ***crop*** and ***site*** for indexing. That is to say I think it's most beneficial to use ***crop*** and ***site*** rather than ***field*** whenever possible.\n",
+    "\n",
+    "The only complication is the numeric suffix present in some values of ***field***, which delineates areas of the site at which the samples were observed. The numeric values are unique per date-crop-place combination but not unique per site-crop combination, so I'll have to treat it as a separate column that represents an additional dimension to our independent variables.\n",
+    "\n",
     "Extracting the unique pairs of dates and the corresponding numerals from the ***field*** value for that date:"
    ]
   },
@@ -8952,14 +8948,14 @@
     "    )\n",
     "    .drop_duplicates(\n",
     "        subset=[\n",
-    "            ('field', 'number'),\n",
     "            ('date', 'date'),\n",
+    "            ('field', 'number'),\n",
     "        ],\n",
     "    )\n",
     "    .dropna(\n",
     "        subset=[\n",
-    "            ('field', 'number'),\n",
     "            ('date', 'date'),\n",
+    "            ('field', 'number'),  # @todo: needed?\n",
     "        ],\n",
     "    )\n",
     ")\n"
@@ -8969,7 +8965,29 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Based on this look at the information, I'd like to throw away the remaining text portion when I apply this to my data set."
+    "Based on this look at the information, I'd like to throw away the remaining text portion after extractinng the numbers when I apply this to my data set."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "When it comes to actually using the number portion of the ***field*** column rather than merely displaying it here in the notebook, it's simpler just to assign a new `Series` to the `DataFrame` than it is to non-destructively express the intended outcome."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 65,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for sheet in (hc, s2):\n",
+    "    sheet.field = (\n",
+    "        sheet.field\n",
+    "        .str.extract(pat=r'(?P<text>\\D*)(?P<number>\\d*)')\n",
+    "        .loc[:, 'number']\n",
+    "        .apply(pandas.to_numeric, downcast='integer')\n",
+    "    )"
    ]
   },
   {
@@ -8978,7 +8996,6 @@
    "source": [
     "###### @todo:\n",
     "\n",
-    "- [ ] apply extraction to data\n",
     "- [ ] remove AB, MB beforehand"
    ]
   },
@@ -8998,7 +9015,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 65,
+   "execution_count": 66,
    "metadata": {},
    "outputs": [
     {
@@ -9014,7 +9031,7 @@
        "Name: number of samples, dtype: int64"
       ]
      },
-     "execution_count": 65,
+     "execution_count": 66,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9032,7 +9049,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 66,
+   "execution_count": 67,
    "metadata": {},
    "outputs": [
     {
@@ -9141,7 +9158,7 @@
        "176    Combined                2.0"
       ]
      },
-     "execution_count": 66,
+     "execution_count": 67,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9189,7 +9206,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": 68,
    "metadata": {},
    "outputs": [
     {
@@ -9364,7 +9381,7 @@
        "pea      pea aphids                                    ✖️      ✅"
       ]
      },
-     "execution_count": 67,
+     "execution_count": 68,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9421,7 +9438,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 68,
+   "execution_count": 69,
    "metadata": {},
    "outputs": [
     {
@@ -9430,7 +9447,7 @@
        "['ega', 'greenbug']"
       ]
      },
-     "execution_count": 68,
+     "execution_count": 69,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9466,7 +9483,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 69,
+   "execution_count": 70,
    "metadata": {},
    "outputs": [
     {
@@ -9609,7 +9626,7 @@
        "         greenbug_apt                                   ✅     ✖️"
       ]
      },
-     "execution_count": 69,
+     "execution_count": 70,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9657,7 +9674,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 70,
+   "execution_count": 71,
    "metadata": {},
    "outputs": [
     {
@@ -9787,7 +9804,7 @@
        "[216 rows x 4 columns]"
       ]
      },
-     "execution_count": 70,
+     "execution_count": 71,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -9820,7 +9837,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 71,
+   "execution_count": 72,
    "metadata": {},
    "outputs": [
     {
@@ -9829,7 +9846,7 @@
        "0.18981481481481483"
       ]
      },
-     "execution_count": 71,
+     "execution_count": 72,
      "metadata": {},
      "output_type": "execute_result"
     }