Skip to content

Commit

Permalink
🚧 📝 resolve todo in section: Align Selected Data Sets: Align Columns …
Browse files Browse the repository at this point in the history
…-> Field
  • Loading branch information
devvyn committed Aug 28, 2018
1 parent be2f8bf commit f039321
Showing 1 changed file with 45 additions and 28 deletions.
73 changes: 45 additions & 28 deletions notebook/projects/2016-sweep-vs-tiller/2016-sweep-vs-tiller.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8473,9 +8473,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"It's clear to my eyes that in most cases, ***site*** and ***crop*** have been concatenated together to produce the value in ***field*** or ***field_name***.\n",
"It's clear to my eyes that in some cases, ***site*** and ***crop*** have been concatenated together to produce the value in ***field***.\n",
"\n",
"- ***site*** + ***crop*** = ***field***(***_name***)\n",
"- ***site*** + ***crop*** = ***field***\n",
"\n",
"Primary data:\n",
"\n",
Expand All @@ -8485,15 +8485,11 @@
"Mixed data:\n",
"\n",
"- ***field***\n",
"- ***field_name***\n",
"\n",
"Therefore, in those cases, I can safely disregard those derived columns and rely on the the more normalized forms for indexing. That is to say I think it's most beneficial to use ***crop*** and ***site*** whenever possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Therefore, in samples corresponding to purely derivative ***field*** values, I can safely disregard those derived columns and rely on the the more normalized ***crop*** and ***site*** for indexing. That is to say I think it's most beneficial to use ***crop*** and ***site*** rather than ***field*** whenever possible.\n",
"\n",
"The only complication is the numeric suffix present in some values of ***field***, which delineates areas of the site at which the samples were observed. The numeric values are unique per date-crop-place combination but not unique per site-crop combination, so I'll have to treat it as a separate column that represents an additional dimension to our independent variables.\n",
"\n",
"Extracting the unique pairs of dates and the corresponding numerals from the ***field*** value for that date:"
]
},
Expand Down Expand Up @@ -8952,14 +8948,14 @@
" )\n",
" .drop_duplicates(\n",
" subset=[\n",
" ('field', 'number'),\n",
" ('date', 'date'),\n",
" ('field', 'number'),\n",
" ],\n",
" )\n",
" .dropna(\n",
" subset=[\n",
" ('field', 'number'),\n",
" ('date', 'date'),\n",
" ('field', 'number'), # @todo: needed?\n",
" ],\n",
" )\n",
")\n"
Expand All @@ -8969,7 +8965,29 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on this look at the information, I'd like to throw away the remaining text portion when I apply this to my data set."
"Based on this look at the information, I'd like to throw away the remaining text portion after extractinng the numbers when I apply this to my data set."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When it comes to actually using the number portion of the ***field*** column rather than merely displaying it here in the notebook, it's simpler just to assign a new `Series` to the `DataFrame` than it is to non-destructively express the intended outcome."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"for sheet in (hc, s2):\n",
" sheet.field = (\n",
" sheet.field\n",
" .str.extract(pat=r'(?P<text>\\D*)(?P<number>\\d*)')\n",
" .loc[:, 'number']\n",
" .apply(pandas.to_numeric, downcast='integer')\n",
" )"
]
},
{
Expand All @@ -8978,7 +8996,6 @@
"source": [
"###### @todo:\n",
"\n",
"- [ ] apply extraction to data\n",
"- [ ] remove AB, MB beforehand"
]
},
Expand All @@ -8998,7 +9015,7 @@
},
{
"cell_type": "code",
"execution_count": 65,
"execution_count": 66,
"metadata": {},
"outputs": [
{
Expand All @@ -9014,7 +9031,7 @@
"Name: number of samples, dtype: int64"
]
},
"execution_count": 65,
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
Expand All @@ -9032,7 +9049,7 @@
},
{
"cell_type": "code",
"execution_count": 66,
"execution_count": 67,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -9141,7 +9158,7 @@
"176 Combined 2.0"
]
},
"execution_count": 66,
"execution_count": 67,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -9189,7 +9206,7 @@
},
{
"cell_type": "code",
"execution_count": 67,
"execution_count": 68,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -9364,7 +9381,7 @@
"pea pea aphids ✖️ ✅"
]
},
"execution_count": 67,
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -9421,7 +9438,7 @@
},
{
"cell_type": "code",
"execution_count": 68,
"execution_count": 69,
"metadata": {},
"outputs": [
{
Expand All @@ -9430,7 +9447,7 @@
"['ega', 'greenbug']"
]
},
"execution_count": 68,
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -9466,7 +9483,7 @@
},
{
"cell_type": "code",
"execution_count": 69,
"execution_count": 70,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -9609,7 +9626,7 @@
" greenbug_apt ✅ ✖️"
]
},
"execution_count": 69,
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -9657,7 +9674,7 @@
},
{
"cell_type": "code",
"execution_count": 70,
"execution_count": 71,
"metadata": {},
"outputs": [
{
Expand Down Expand Up @@ -9787,7 +9804,7 @@
"[216 rows x 4 columns]"
]
},
"execution_count": 70,
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -9820,7 +9837,7 @@
},
{
"cell_type": "code",
"execution_count": 71,
"execution_count": 72,
"metadata": {},
"outputs": [
{
Expand All @@ -9829,7 +9846,7 @@
"0.18981481481481483"
]
},
"execution_count": 71,
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
Expand Down

0 comments on commit f039321

Please sign in to comment.