Added some explanations

jessegeerts · Apr 20, 2018 · ec199ef · ec199ef
1 parent 7400f80
commit ec199ef
Show file tree

Hide file tree

Showing 2 changed files with 186 additions and 47 deletions.
diff --git a/.ipynb_checkpoints/python_stats_intro-checkpoint.ipynb b/.ipynb_checkpoints/python_stats_intro-checkpoint.ipynb
@@ -24,7 +24,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Pandas is a Python package for easy to use data structures and analysis tools. "
+    "Pandas is a Python package for easy to use data structures and analysis tools. The main tool it uses is the pandas DataFrame, which is very similar to R's data.frame and ideal for data exploration"
    ]
   },
   {
@@ -33,7 +33,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# IQ and brain size data\n",
+    "# Load in a dataset that measured participants' IQ and brain size, among some other characteristics\n",
     "data = pd.read_csv('data/brain_size.csv', sep=';', na_values='.')"
    ]
   },
@@ -43,6 +43,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# The head() function allows you to inspect the first few entries in your dataframe. \n",
     "data.head()"
    ]
   },
@@ -75,13 +76,6 @@
     "females.head()"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -102,6 +96,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# The groupby method allows you to extract characteristics grouped by categorical variables. For example: the mean\n",
+    "# of all continuous variables grouped by gender:\n",
     "data.groupby('Gender').mean()"
    ]
   },
@@ -253,6 +249,13 @@
     "stats.ttest_rel(data['FSIQ'], data['PIQ'])"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The Wilcoxon sign test signed rank test is a close sibling of the dependent samples t-test.  Because the dependent samples t-tests analyzes if the average difference of two repeated measures is zero; it requires metric (interval or ratio) and normally distributed data; the Wilcoxon sign test uses ranked or ordinal data.  Thus it is a common alternative to the dependent samples t-test when its assumptions are not met."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -280,13 +283,6 @@
     "plt.show()"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The Wilcoxon sign test signed rank test is a close sibling of the dependent samples t-test.  Because the dependent samples t-tests analyzes if the average difference of two repeated measures is zero; it requires metric (interval or ratio) and normally distributed data; the Wilcoxon sign test uses ranked or ordinal data.  Thus it is a common alternative to the dependent samples t-test when its assumptions are not met."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -419,19 +415,14 @@
     "iris_data.head()"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
+    "# the plotting.scatter_matrix method allows you to plot the different categories in your data as different colours\n",
+    "# using the pandas.Categorical class as an entry in the 'color' keyword argument. \n",
     "categories = pd.Categorical(iris_data['name'])\n",
     "categories"
    ]
@@ -442,6 +433,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# That way, we can plot our variables in separate colours for the different flower types\n",
     "plotting.scatter_matrix(iris_data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], c=categories.labels)\n",
     "plt.show()"
    ]
@@ -452,6 +444,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Statsmodels allows you to define a multiple regression model with R syntax like this:\n",
     "model = ols('sepal_width ~ name + petal_length + sepal_length', iris_data).fit()"
    ]
   },
@@ -488,6 +481,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Testing for interactions is as simple as using the multiplication symbol in defining your model\n",
+    "# This way, it will test for main effects and interaction. \n",
     "model = ols('sepal_width ~ name + petal_length * petal_width', iris_data).fit()"
    ]
   },
@@ -1173,13 +1168,57 @@
    "outputs": [],
    "source": []
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Converting variables from python to R"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "# "
+    "# Author: Charly\n",
+    "\n",
+    "from rpy2.robjects.vectors import Matrix, Array, DataFrame, FloatVector, IntVector, StrVector, ListVector\n",
+    "import numpy as np\n",
+    "from pandas import DataFrame as PdDF\n",
+    "from collections import OrderedDict\n",
+    "known_r_types = Matrix, Array, DataFrame, FloatVector, IntVector, StrVector, ListVector\n",
+    "\n",
+    "python_to_r_types = {\n",
+    "   'list': (StrVector, ),\n",
+    "   'dict': (ListVector, ),\n",
+    "   'np_array': (FloatVector, IntVector, Array, Matrix),\n",
+    "   'pandas_df': (DataFrame, )\n",
+    "}\n",
+    "def recursive_r_to_py(data):\n",
+    "   \"\"\"\n",
+    "   The recursive function to convert from rpy2 objects to native python\n",
+    "   \"\"\"\n",
+    "\n",
+    "   dtype = type(data)\n",
+    "   if dtype in python_to_r_types['dict']:\n",
+    "       return OrderedDict(zip(data.names, [recursive_r_to_py(d) for d in data]))\n",
+    "   elif dtype in python_to_r_types['list']:\n",
+    "       return [recursive_r_to_py(d) for d in data]\n",
+    "   elif dtype in python_to_r_types['np_array']:\n",
+    "       array = np.array(data)\n",
+    "       if array.size == 1:\n",
+    "           return array[0]\n",
+    "       else:\n",
+    "           return array\n",
+    "   elif dtype in python_to_r_types['pandas']:\n",
+    "       return PdDF(data)\n",
+    "   else:\n",
+    "       if is_r_type(data):  # An unknown r class\n",
+    "           raise NotImplementedError('Could not proceed, type {} is not defined.'\n",
+    "                                     'Recognised types are: {}'. format(dtype, known_r_types))\n",
+    "       else:\n",
+    "           return data  # We reached the end of recursion"
    ]
   },
   {
@@ -1189,6 +1228,55 @@
    "outputs": [],
    "source": []
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "code",
    "execution_count": null,