From 32ce0589801e52082382325a8646807e416c3535 Mon Sep 17 00:00:00 2001 From: Quarto GHA Workflow Runner Date: Wed, 22 May 2024 01:03:55 +0000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- modules/week09/hw-09.html | 2 +- modules/week09/index-09.html | 2 +- modules/week09/whale-sdcexercise.html | 768 ----------------------- search.json | 835 +++++++++++++------------- sitemap.xml | 148 +++-- 6 files changed, 489 insertions(+), 1268 deletions(-) delete mode 100644 modules/week09/whale-sdcexercise.html diff --git a/.nojekyll b/.nojekyll index e625da0..9bdede0 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -7ee157fb \ No newline at end of file +3b416e22 \ No newline at end of file diff --git a/modules/week09/hw-09.html b/modules/week09/hw-09.html index 919b7b5..4ba4afb 100644 --- a/modules/week09/hw-09.html +++ b/modules/week09/hw-09.html @@ -274,7 +274,7 @@

Week 9 - Whale entanglement sdcMicro exercise

-

Your team has successfully obtained a dataset1 that encompasses whale entanglement data associated with specific fisheries along the West Coast. This dataset, named whale-sdc.csv, and an accompanying file called whale-exercise.Rmd.

+

Your team has successfully obtained a dataset1 that encompasses whale entanglement data associated with specific fisheries along the West Coast. This dataset, named whale-sdc.csv, and an accompanying file called whale-exercise.Rmd.

In groups of two or three, your task is to thoroughly examine the dataset and complete the provided R Markdown file. This entails implementing the necessary code and addressing the given questions. To ensure proper identification, please include the names of all participating members in the YAML header before submitting the modified R Markdown file.

diff --git a/modules/week09/index-09.html b/modules/week09/index-09.html index 6353933..f10832c 100644 --- a/modules/week09/index-09.html +++ b/modules/week09/index-09.html @@ -303,7 +303,7 @@

Slides and othe -

Demo South Park: - Dataset - RMD document

+

Demo South Park: - [Dataset] - [RMD document]

Resources

diff --git a/modules/week09/whale-sdcexercise.html b/modules/week09/whale-sdcexercise.html deleted file mode 100644 index 8a7bcfb..0000000 --- a/modules/week09/whale-sdcexercise.html +++ /dev/null @@ -1,768 +0,0 @@ - - - - - - - - - - - -eds213 - sdcmicro-exercise - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-
- -
- -
- - - - -
- -
-
-

sdcmicro-exercise

-
- - - -
- -
-
Author
-
-

YOUR NAMES HERE

-
-
- -
-
Published
-
-

May 24, 2023

-
-
- - -
- - - -
- - -
-

Whale Entanglement sdcMicro Exercise

-

Your team acquired a dataset* whale-sdc.csv from researchers working with whale entanglement data on the West Coast. The dataset contains both direct and indirect identifiers. Your task is to assess the risk of re-identification of the fisheries associated with the cases before considering public release. Then, you should test one technique and apply k-anonymization to help lower the disclosure risk as well as compute the information loss.

-

Please complete this exercise in pairs or groups of three. Each group should download the dataset and complete the rmd file, including the code and answering the questions. Remember to include your names in the YAML.

-

*This dataset was purposefully adapted exclusively for instruction use.

-
-

Setup

-
-
-

Package & Data

-
-
-

Inspect the Dataset

-
-
-

Q1. How many direct identifiers are present in this dataset? What are they?

-

A:

-
-
-

Q2. What attributes would you consider quasi-identifiers? Why?

-

A:

-
-
-

Q3. What types of variables are they? Define them. (numeric, integer, factor or string)

-

Make sure to have them set correctly.

-
-
-

4 Considering your answers to questions 1, 2 and 3 create a SDC problem.

-
-
-

Q4.1 What is the risk of re-identification for this dataset?

-
-
-

Q4.2 To what extent does this dataset violate k-anonymity?

-
-
-

5. Consider techniques that could reduce the risk of re-identification.

-
-
-

Q5.1 Apply one non-perturbative method to a variable of your choice. How effective was it in lowering the disclosure risk?

-
-
-

Q5.2 Apply ( k-3) anonymization to this dataset.

-
-
-

Q6. Compute the information loss for the de-identified version of the dataset.

- - -
-
- -
-
-


-

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License

-

UCSB logo

-
- -
- - - - - \ No newline at end of file diff --git a/search.json b/search.json index 3e8a437..7fa65c8 100644 --- a/search.json +++ b/search.json @@ -179,7 +179,7 @@ "href": "modules/week09/index-09.html#slides-and-other-materials", "title": "Week 9 - Sensitive data", "section": "Slides and other materials", - "text": "Slides and other materials\n\nslides-09.pptx\n\nDemo South Park: - Dataset - RMD document" + "text": "Slides and other materials\n\nslides-09.pptx\n\nDemo South Park: - [Dataset] - [RMD document]" }, { "objectID": "modules/week09/index-09.html#resources", @@ -203,347 +203,235 @@ "text": "In-class exercise (Day 2)\nInstructions for the Whale Entanglement sdcMicro Exercise" }, { - "objectID": "modules/week09/hw-09.html", - "href": "modules/week09/hw-09.html", - "title": "Week 9 - Whale entanglement sdcMicro exercise", - "section": "", - "text": "Your team has successfully obtained a dataset1 that encompasses whale entanglement data associated with specific fisheries along the West Coast. This dataset, named whale-sdc.csv, and an accompanying file called whale-exercise.Rmd.\nIn groups of two or three, your task is to thoroughly examine the dataset and complete the provided R Markdown file. This entails implementing the necessary code and addressing the given questions. To ensure proper identification, please include the names of all participating members in the YAML header before submitting the modified R Markdown file." - }, - { - "objectID": "modules/week09/hw-09.html#footnotes", - "href": "modules/week09/hw-09.html#footnotes", - "title": "Week 9 - Whale entanglement sdcMicro exercise", - "section": "Footnotes", - "text": "Footnotes\n\n\nThis dataset was purposefully adapted exclusively for instruction use.↩︎" - }, - { - "objectID": "modules/hw_bonus.html", - "href": "modules/hw_bonus.html", - "title": "Bonus Homework", - "section": "", - "text": "One might wonder if egg volumes are larger during warmer months. Indeed, in Egg size variation within passerine clutches: effects of ambient temperature and laying sequence 1, the authors report that:\n\nSlight but statistically significant positive correlations were detected between daily temperatures (mostly mean and minimum) and egg size. The first eggs of the clutch were often affected by the temperatures occurring about a week before they were laid. These temperatures probably influence the development of the insects from eggs and pupae providing protein for the egg-forming female. The last eggs of the clutch tended to be affected by the temperatures prevailing one to three days before laying, i.e.occurring in the most intensive period of egg formation.\n\nThere are multiple factors at play here, including clutch size and laying order, and we don’t have much data to work with using our class database, but still, we can investigate if there is any change in average egg volume between the months of June and July, hypothesizing that July is warmer than June.\nPlease submit your SQL code and your Python or R notebook." - }, - { - "objectID": "modules/hw_bonus.html#relationship-between-egg-volume-and-time-of-the-year", - "href": "modules/hw_bonus.html#relationship-between-egg-volume-and-time-of-the-year", - "title": "Bonus Homework", - "section": "", - "text": "One might wonder if egg volumes are larger during warmer months. Indeed, in Egg size variation within passerine clutches: effects of ambient temperature and laying sequence 1, the authors report that:\n\nSlight but statistically significant positive correlations were detected between daily temperatures (mostly mean and minimum) and egg size. The first eggs of the clutch were often affected by the temperatures occurring about a week before they were laid. These temperatures probably influence the development of the insects from eggs and pupae providing protein for the egg-forming female. The last eggs of the clutch tended to be affected by the temperatures prevailing one to three days before laying, i.e.occurring in the most intensive period of egg formation.\n\nThere are multiple factors at play here, including clutch size and laying order, and we don’t have much data to work with using our class database, but still, we can investigate if there is any change in average egg volume between the months of June and July, hypothesizing that July is warmer than June.\nPlease submit your SQL code and your Python or R notebook." - }, - { - "objectID": "modules/hw_bonus.html#step-1", - "href": "modules/hw_bonus.html#step-1", - "title": "Bonus Homework", - "section": "Step 1", - "text": "Step 1\nCreate a query to compute and group average egg volume by species and month. As before, use for volume the formula\n\\[{\\pi \\over 6} W^2 L\\]\nwhere \\(W\\) is egg width and \\(L\\) is egg length, and use 3.14 for \\(\\pi\\). Call this table T." - }, - { - "objectID": "modules/hw_bonus.html#step-2", - "href": "modules/hw_bonus.html#step-2", - "title": "Bonus Homework", - "section": "Step 2", - "text": "Step 2\nLooking at table T, you’ll notice that we have egg data for months 6 and 7 for most species, but there is one species for which there is only data for month 6. We want to exclude all such species since there will be nothing to plot for them. How to do that? Here’s a hint. First, create a query that identifies the set of species having 2 rows in T. Then, select the rows from T where the species is in the aforementioned set.\nJoin this reduced table with the Species table to grab scientific names, and write out to a CSV file." - }, - { - "objectID": "modules/hw_bonus.html#step-3", - "href": "modules/hw_bonus.html#step-3", - "title": "Bonus Homework", - "section": "Step 3", - "text": "Step 3\nUse R or Python to plot average egg volume as a function of month, by species. An example is shown below." - }, - { - "objectID": "modules/hw_bonus.html#footnotes", - "href": "modules/hw_bonus.html#footnotes", - "title": "Bonus Homework", - "section": "Footnotes", - "text": "Footnotes\n\n\nMikko Ojanen, Markku Orell, and Risto A. Väisänen (1981). Egg size variation within passerine clutches: effects of ambient temperature and laying sequence. Ornis Fennica 58:93-108. https://ornisfennica.journal.fi/article/view/133071↩︎" - }, - { - "objectID": "modules/week01/hw-01-2.html", - "href": "modules/week01/hw-01-2.html", - "title": "Week 1 - Data modeling", - "section": "", - "text": "Please use Canvas to return the assignments: https://ucsb.instructure.com/courses/19301/assignments/224311\nCreate a table definition for the Snow_survey table that is maximally expressive, that is, that captures as much of the semantics and characteristics of the data using SQL’s data definition language as is possible.\nIn the class data GitHub repository, week 1 directory you will find the table described in the metadata (consult 01_ASDN_Readme.txt) and the data can be found in ASDN_Snow_survey.csv. You will want to look at the values that occur in the data using a tool like R, Python, or OpenRefine.\nPlease consider:\nYou may (or may not) want to take advantage of the Species, Site, Color_band_code, and Personnel supporting tables. These are also documented in the metadata, and SQL table definitions for them have already been created and are included below.\nPlease express your table definition in SQL, but don’t worry about getting the SQL syntax exactly correct. This assignment is just a thought exercise. If you do want to try to write correct SQL, though, your may find it helpful to consult the DuckDB CREATE TABLE documentation.\nFinally, please provide some explanation for why you made the choices you did, and any questions or uncertainties you have. Don’t write an essay! Bullet points are sufficient. But do please explain your thought process." - }, - { - "objectID": "modules/week01/hw-01-2.html#appendix", - "href": "modules/week01/hw-01-2.html#appendix", - "title": "Week 1 - Data modeling", - "section": "Appendix", - "text": "Appendix\nCREATE TABLE Species (\n Code TEXT PRIMARY KEY,\n Common_name TEXT UNIQUE NOT NULL,\n Scientific_name TEXT,\n Relevance TEXT\n);\n\nCREATE TABLE Site (\n Code TEXT PRIMARY KEY,\n Site_name TEXT UNIQUE NOT NULL,\n Location TEXT NOT NULL,\n Latitude REAL NOT NULL CHECK (Latitude BETWEEN -90 AND 90),\n Longitude REAL NOT NULL CHECK (Longitude BETWEEN -180 AND 180),\n \"Total_Study_Plot_Area_(ha)\" REAL NOT NULL\n CHECK (\"Total_Study_Plot_Area_(ha)\" > 0),\n UNIQUE (Latitude, Longitude)\n);\n\nCREATE TABLE Color_band_code (\n Code TEXT PRIMARY KEY,\n Color TEXT NOT NULL UNIQUE\n);\n\nCREATE TABLE Personnel (\n Abbreviation TEXT PRIMARY KEY,\n Name TEXT NOT NULL UNIQUE\n);" - }, - { - "objectID": "modules/week03/hw-03-3.html", - "href": "modules/week03/hw-03-3.html", - "title": "Week 3 - SQL problem 3", + "objectID": "modules/week10/index-10.html", + "href": "modules/week10/index-10.html", + "title": "Week 10 - Data licensing and publication", "section": "", - "text": "Your mission is to list the scientific names of bird species in descending order of their maximum average egg volumes. That is, compute the average volume of the eggs in each nest, and then for the nests of each species compute the maximum of those average volumes, and list by species in descending order of maximum volume. You final table should look like:\n┌─────────────────────────┬────────────────────┐\n│ Scientific_name │ Max_avg_volume │\n│ varchar │ double │\n├─────────────────────────┼────────────────────┤\n│ Pluvialis squatarola │ 36541.8525390625 │\n│ Pluvialis dominica │ 33847.853515625 │\n│ Arenaria interpres │ 23338.6220703125 │\n│ Calidris fuscicollis │ 13277.143310546875 │\n│ Calidris alpina │ 12196.237548828125 │\n│ Charadrius semipalmatus │ 11266.974975585938 │\n│ Phalaropus fulicarius │ 8906.775146484375 │\n└─────────────────────────┴────────────────────┘\n(By the way, regarding the leader in egg size above, Birds of the World says that Pluvialis squatarola’s eggs are “Exceptionally large for size of female (ca. 16% weight of female)”.)\nTo calculate the volume of an egg, use the simplified formula\n\\[{\\pi \\over 6} W^2 L\\]\nwhere \\(W\\) is the egg width and \\(L\\) is the egg length. You can use 3.14 for \\(\\pi\\). (The real formula takes into account the ovoid shape of eggs, but only width and length are available to us here.)\nA good place to start is just to group bird eggs by nest (i.e., Nest_ID) and compute average volumes:\nCREATE TEMP TABLE Averages AS\n SELECT Nest_ID, AVG(...) AS Avg_volume\n FROM ...\n GROUP BY ...;\nYou can now join that table with Bird_nests, so that you can group by species, and also join with the Species table to pick up scientific names. To do just the first of those joins, you could say something like\nSELECT Species, MAX(...)\n FROM Bird_nests JOIN Averages USING (Nest_ID)\n GROUP BY ...;\n(Notice how, if the joined columns have the same name, you can more compactly say USING (common_column) instead of ON column_a = column_b.)\nThat’s not the whole story, we want scientific names not species codes. Another join is needed. A couple strategies here. One, you can modify the above query to also join with the Species table (you’ll need to replace USING with ON …). Two, you can save the above as another temp table and join it to Species separately.\nDon’t forget to order the results. Here it is convenient to give computed quantities nice names so you can refer to them.\nPlease submit all of the SQL you used to solve the problem. Bonus points if you can do all of the above in one statement." + "text": "Become familiar with data licensing in the academic context.\nDistinguish which research deliverables can or cannot be subject to copyright and explore alternatives to that.\nGain an understanding of the Creative Commons license family and the differences in their applicability.\n\n\n\n\nslides-10-part1.pptx" }, { - "objectID": "modules/week03/hw-03-1.html", - "href": "modules/week03/hw-03-1.html", - "title": "Week 3 - SQL problem 1", + "objectID": "modules/week10/index-10.html#part-i---data-licensing", + "href": "modules/week10/index-10.html#part-i---data-licensing", + "title": "Week 10 - Data licensing and publication", "section": "", - "text": "It’s a useful skill in life (I’m not being rhetorical, I really mean that, it’s a useful skill) to be able to construct an experiment to answer a hypothesis. Suppose you’re not sure what the AVG function returns if there are NULL values in the column being averaged. Suppose you either didn’t have access to any documentation, or didn’t trust it. What experiment could you run to find out what happens?\nThere are two parts to this problem.\n\nPart 1\nConstruct an SQL experiment to determine the answer to the question above. Does SQL abort with some kind of error? Does it ignore NULL values? Do the NULL values somehow factor into the calculation, and if so, how?\nI would suggest you start by creating a table (in the bird database, in a new database, in a transient in-memory database, doesn’t matter) with a single column that has data type REAL (for part 2 below, it must be REAL). You can make your table a temp table or not, your choice.\nCREATE TEMP TABLE mytable... ;\nNow insert some real numbers and at least one NULL into your table.\nINSERT INTO mytable... ;\n(Hmm, can you insert multiple rows at once, or do you have to do a separate INSERT for each row?)\nOnce you have your little table constructed, try doing an AVG on the column and see what is returned. What would the average be if the function ignored NULLs? What would the average be if it somehow factored them in? What is actually returned?\nPlease submit both your SQL and your answer to the question about how AVG operates in the presence of NULL values.\n\n\nPart 2\nIf SQL didn’t have an AVG function, you could compute the average value of a column by doing something like this on your table:\nSELECT SUM(mycolumn)/COUNT(*) FROM mytable;\nSELECT SUM(mycolumn)/COUNT(mycolumn) FROM mytable;\nWhich query above is correct? Please explain why.\nNow that you’re done with your table, you can delete it if desired:\nDROP TABLE mytable;" + "text": "Become familiar with data licensing in the academic context.\nDistinguish which research deliverables can or cannot be subject to copyright and explore alternatives to that.\nGain an understanding of the Creative Commons license family and the differences in their applicability.\n\n\n\n\nslides-10-part1.pptx" }, { - "objectID": "modules/week06/python-programming.html", - "href": "modules/week06/python-programming.html", - "title": "eds213", - "section": "", - "text": "import duckdb\n\nExample of Jupyter “magic command”:\n\n%pwd\n\n'/Users/gjanee-local/Desktop/meds/bren-meds213-spring-2024-class-data/week3'\n\n\nTo install DuckDB Python module:\n\n# %pip install duckdb\n\n\nCreate a connection and a cursor\n\n\nconn = duckdb.connect(\"database.db\")\n\n\nconn\n\n<duckdb.duckdb.DuckDBPyConnection at 0x1040abb70>\n\n\n\ncur = conn.cursor()\n\nNow let’s do something with our cursor\n\ncur.execute(\"SELECT * FROM Site LIMIT 5\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\nNow we want results… three ways of getting them. 1. All results at once\n\ncur.fetchall()\n\n[('barr',\n 'Barrow',\n 'Alaska, USA',\n 71.30000305175781,\n -156.60000610351562,\n 220.39999389648438),\n ('burn',\n 'Burntpoint Creek',\n 'Ontario, Canada',\n 55.20000076293945,\n -84.30000305175781,\n 63.0),\n ('bylo',\n 'Bylot Island',\n 'Nunavut, Canada',\n 73.19999694824219,\n -80.0,\n 723.5999755859375),\n ('cakr',\n 'Cape Krusenstern',\n 'Alaska, USA',\n 67.0999984741211,\n -163.5,\n 54.099998474121094),\n ('cari',\n 'Canning River Delta',\n 'Alaska, USA',\n 70.0999984741211,\n -145.8000030517578,\n 722.0)]\n\n\nCursors don’t store anything, they just transfer queries to the database and get results back.\n\ncur.fetchall()\n\n[]\n\n\nAlways get tuples, even if you only request one column\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.fetchall()\n\n[('14HPE1',),\n ('11eaba',),\n ('11eabaagc01',),\n ('11eabaagv01',),\n ('11eababbc02',),\n ('11eababsv01',),\n ('11eabaduh01',),\n ('11eabaduv01',),\n ('11eabarpc01',),\n ('11eabarpc02',)]\n\n\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\n[t[0] for t in cur.fetchall()]\n\n['14HPE1',\n '11eaba',\n '11eabaagc01',\n '11eabaagv01',\n '11eababbc02',\n '11eababsv01',\n '11eabaduh01',\n '11eabaduv01',\n '11eabarpc01',\n '11eabarpc02']\n\n\n\nGet the one result, or the next result\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchall()\n\n[(1547,)]\n\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchone()\n\n(1547,)\n\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchone()[0]\n\n1547\n\n\n\nUsing an iterator - but DuckDB doesn’t support iterators :(\n\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\nfor row in cur:\n print(f\"got {row[0]}\")\n\nTypeError: 'duckdb.duckdb.DuckDBPyConnection' object is not iterable\n\n\nA workaround:\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\nwhile True:\n row = cur.fetchone()\n if row == None:\n break\n # do something with row\n print(f\"got nest ID {row[0]}\")\n\ngot nest ID 14HPE1\ngot nest ID 11eaba\ngot nest ID 11eabaagc01\ngot nest ID 11eabaagv01\ngot nest ID 11eababbc02\ngot nest ID 11eababsv01\ngot nest ID 11eabaduh01\ngot nest ID 11eabaduv01\ngot nest ID 11eabarpc01\ngot nest ID 11eabarpc02\n\n\nCan do things other than SELECT!\n\ncur.execute(\"CREATE TEMP TABLE temp_table AS\n SELECT * FROM Bird_nests LIMIT 10\")\n\nSyntaxError: unterminated string literal (detected at line 1) (1747419494.py, line 1)\n\n\n\ncur.execute(\"\"\"\n CREATE TEMP TABLE temp_table AS\n SELECT * FROM Bird_nests LIMIT 10\n\"\"\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM temp_table\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.fetchall()\n\n[('b14.6',\n 2014,\n 'chur',\n '14HPE1',\n 'sepl',\n 'vloverti',\n datetime.date(2014, 6, 14),\n None,\n 3,\n None,\n None),\n ('b11.7',\n 2011,\n 'eaba',\n '11eaba',\n 'wrsa',\n 'bhill',\n datetime.date(2011, 7, 10),\n 'searcher',\n 4,\n None,\n None),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaagc01',\n 'amgp',\n 'dkessler',\n datetime.date(2011, 6, 24),\n 'searcher',\n 4,\n 6.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaagv01',\n 'amgp',\n 'dkessler',\n datetime.date(2011, 6, 25),\n 'searcher',\n 3,\n 3.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eababbc02',\n 'bbpl',\n 'dkessler',\n datetime.date(2011, 6, 24),\n 'searcher',\n 4,\n 4.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eababsv01',\n 'wrsa',\n 'bhill',\n datetime.date(2011, 7, 7),\n 'searcher',\n 4,\n 2.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaduh01',\n 'dunl',\n 'dkessler',\n datetime.date(2011, 6, 28),\n 'searcher',\n 3,\n 2.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaduv01',\n 'dunl',\n 'dkessler',\n datetime.date(2011, 6, 29),\n 'searcher',\n 4,\n 5.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eabarpc01',\n 'reph',\n 'bhill',\n datetime.date(2011, 7, 8),\n 'searcher',\n 4,\n 4.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eabarpc02',\n 'reph',\n 'bhill',\n datetime.date(2011, 7, 8),\n 'searcher',\n 3,\n 4.0,\n 'float')]\n\n\nA note on fragility\nFor example: INSERT INTO Site VALUES (“abcd”, “Foo”, 35.7, 42.3, “?”)\nA less fragile way of expressing the same thing: INSERT INTO Site (Code, Site_name, Latitude, Longitude, Something_else) VALUES (“abcd”, “Foo”, 35.7, 42.3, “?”)\nIn the same vein: SELECT * is fragile\n\ncur.execute(\"SELECT * FROM Site LIMIT 3\")\ncur.fetchall()\n\n[('barr',\n 'Barrow',\n 'Alaska, USA',\n 71.30000305175781,\n -156.60000610351562,\n 220.39999389648438),\n ('burn',\n 'Burntpoint Creek',\n 'Ontario, Canada',\n 55.20000076293945,\n -84.30000305175781,\n 63.0),\n ('bylo',\n 'Bylot Island',\n 'Nunavut, Canada',\n 73.19999694824219,\n -80.0,\n 723.5999755859375)]\n\n\nA better, more robust way of coding the same thing:\n\ncur.execute(\"SELECT Site_name, Code, Latitude, Longitude FROM Site LIMIT 3\")\ncur.fetchall()\n\n[('Barrow', 'barr', 71.30000305175781, -156.60000610351562),\n ('Burntpoint Creek', 'burn', 55.20000076293945, -84.30000305175781),\n ('Bylot Island', 'bylo', 73.19999694824219, -80.0)]\n\n\nAn extended example: Question we’re trying to answer: How many nests do we have for each species?\nApproach: first get all species. Then execute a count query for each species.\nA digression: string interpolation in Python\n\n# The % method\ns = \"My name is %s\"\nprint(s % \"Greg\")\ns = \"My name is %s and the other teacher's name is %s\"\nprint(s % (\"Greg\", \"Julien\"))\n# The new f-string method\nname = \"Greg\"\nprint(f\"My name is {name}\")\n# Third way\nprint(\"My name is {}\".format(\"Greg\"))\n\nMy name is Greg\nMy name is Greg and the other teacher's name is Julien\nMy name is Greg\nMy name is Greg\n\n\n\nquery = \"\"\"\n SELECT COUNT(*) FROM Bird_nests\n WHERE Species = '%s'\n\"\"\"\ncur.execute(\"SELECT Code FROM Species LIMIT 3\")\nfor row in cur.fetchall(): # DuckDB workaround\n code = row[0]\n prepared_query = query % code\n #print(prepared_query)\n cur2 = conn.cursor()\n cur2.execute(prepared_query)\n print(f\"Species {code} has {cur2.fetchone()[0]} nests\")\n cur2.close()\n\nSpecies agsq has 0 nests\nSpecies amcr has 0 nests\nSpecies amgp has 29 nests\n\n\nThe above Python interpolation is dangerous and has caused many database hacks! There’s a better way\n\nquery = \"\"\"\n SELECT COUNT(*) FROM Bird_nests\n WHERE Species = ?\n\"\"\"\ncur.execute(\"SELECT Code FROM Species LIMIT 3\")\nfor row in cur.fetchall(): # DuckDB workaround\n code = row[0]\n # NOT NEEDED! prepared_query = query % code\n #print(prepared_query)\n cur2 = conn.cursor()\n cur2.execute(query, [code]) # <-- added argument here\n print(f\"Species {code} has {cur2.fetchone()[0]} nests\")\n cur2.close()\n\nSpecies agsq has 0 nests\nSpecies amcr has 0 nests\nSpecies amgp has 29 nests\n\n\nLet’s illustrate the danger with a different example\n\nabbrev = \"TS\"\nname = \"Taylor Swift\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n )\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM Personnel\")\ncur.fetchall()[-3:]\n\n[('emagnuson', 'Emily Magnuson'),\n ('mcorrell', 'Maureen Correll'),\n ('TS', 'Taylor Swift')]\n\n\n\nabbrev = \"CO\"\nname = \"Conan O'Brien\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n )\n\nParserException: Parser Error: syntax error at or near \"Brien\"\n\n\n\n\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n\n\"\\n INSERT INTO Personnel (Abbreviation, Name)\\n VALUES ('CO', 'Conan O'Brien')\\n \"\n\n\n\nabbrev = \"CO\"\nname = \"Conan O'Brien\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES (?, ?)\n \"\"\",\n [abbrev, name])\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM Personnel\")\ncur.fetchall()[-3:]\n\n[('mcorrell', 'Maureen Correll'),\n ('TS', 'Taylor Swift'),\n ('CO', \"Conan O'Brien\")]" + "objectID": "modules/week10/index-10.html#part-ii---data-publication", + "href": "modules/week10/index-10.html#part-ii---data-publication", + "title": "Week 10 - Data licensing and publication", + "section": "Part II - Data publication", + "text": "Part II - Data publication\n\nLearning goals\n\nUnderstand the importance of publishing research data.\nIdentify and select appropriate approaches to data publication.\nExplain the role of persistent identifiers.\n\n\n\nSlides\nslides-10-part2.pptx" }, { - "objectID": "modules/week06/hw-06-1.html", - "href": "modules/week06/hw-06-1.html", - "title": "Week 6 - Little Bobby Tables", - "section": "", - "text": "View this classic XKCD cartoon:\nFor the purposes of this problem you may assume that at some point the school’s system performs the query\nwhere a student’s name, as input by a user of the system, is directly substituted for the %s. Explain exactly how Little Bobby Tables’ “name” can cause a catastrophe. Also, explain why his name has two dashes (--) at the end." + "objectID": "modules/week10/index-10.html#suggested-readings", + "href": "modules/week10/index-10.html#suggested-readings", + "title": "Week 10 - Data licensing and publication", + "section": "Suggested readings*", + "text": "Suggested readings*\n\nCarroll, M. W. (2015) Sharing Research Data and Intellectual Property Law: A Primer. PLoS Biol 13(8): e1002235. https://doi.org/10.1371/journal.pbio.1002235\nFay, C. (2019). Licensing R. https://thinkr-open.github.io/licensing-r\nReitz, K., & Schlusser, T. (2022). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”. https://docs.python-guide.org/writing/license\n\n*Useful links and other supporting materials are noted in the slides." }, { - "objectID": "modules/week06/hw-06-1.html#bonus-problem", - "href": "modules/week06/hw-06-1.html#bonus-problem", - "title": "Week 6 - Little Bobby Tables", - "section": "Bonus problem!", - "text": "Bonus problem!\nHack your bird database! Let’s imagine that your Shiny application, in response to user input, executes the query\nSELECT * FROM Species WHERE Code = '%s';\nwhere a species code (supplied by the application user) is directly substituted for the query’s %s using Python interpolation. For example, an innocent user might input “wolv”. Craft an input that a devious user could use to:\n\nAdd Taylor Swift to the Personnel table\nYet still return the results of the query SELECT * FROM Species WHERE Code = 'wolv' (devious!)" + "objectID": "modules/week10/index-10.html#homework", + "href": "modules/week10/index-10.html#homework", + "title": "Week 10 - Data licensing and publication", + "section": "Homework", + "text": "Homework\nNo homework this week!" }, { - "objectID": "modules/week06/python-programming-cont.html", - "href": "modules/week06/python-programming-cont.html", - "title": "Pandas", + "objectID": "modules/week01/hw-01-1.html", + "href": "modules/week01/hw-01-1.html", + "title": "Week 1 - Create an ER diagram", "section": "", - "text": "To install the duckdb Python package:\n\n%pip install duckdb\n\nCommon model: connect to database, get a cursor. In Python, all database packages follow the DB-API standard, so they all look the same. See course website for pointer to DB-API.\n\nimport duckdb\n\n\nconn = duckdb.connect(\"database.db\")\n\nCursor mediates access to query, getting results. Can deal with one query at a time.\n\ncur = conn.cursor()\n\nGet all results. Cursor is streaming mechanism, does not store results.\n\ncur.execute(\"SELECT * FROM Camp_assignment LIMIT 3\")\ncur.fetchall()\n\n[(2005,\n 'bylo',\n 'lmckinnon',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 5)),\n (2005,\n 'bylo',\n 'blalibert',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 20)),\n (2006,\n 'bylo',\n 'lmckinnon',\n datetime.date(2006, 6, 1),\n datetime.date(2006, 8, 5))]\n\n\n\ncur.fetchall()\n\n[]\n\n\nOr get one row at a time\n\ncur.execute(\"SELECT * FROM Camp_assignment LIMIT 3\")\ncur.fetchone()\n\n(2005,\n 'bylo',\n 'lmckinnon',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 5))\n\n\n\ncur.fetchone()\n\n(2005,\n 'bylo',\n 'blalibert',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 20))\n\n\n\ncur.fetchone()\n\n(2006,\n 'bylo',\n 'lmckinnon',\n datetime.date(2006, 6, 1),\n datetime.date(2006, 8, 5))\n\n\n\ncur.fetchone()\n\nExtended example showing looping over cursor (DuckDB does not support direct iteration over cursor), using second cursor, using parameterized queries.\n\ninner_query = \"\"\"\n SELECT COUNT(*) AS num_nests\n FROM Bird_nests\n WHERE Observer = ?\n\"\"\"\n\nouter_query = \"\"\"\n SELECT DISTINCT Observer FROM Bird_nests\n\"\"\"\nfor row in cur.execute(outer_query).fetchall():\n observer = row[0]\n cur2 = conn.cursor()\n cur2.execute(inner_query, [observer])\n print(f\"Observer {observer} gathered {cur2.fetchone()[0]} nests\")\n\nObserver mballvanzee gathered 2 nests\nObserver dkessler gathered 69 nests\nObserver bharrington gathered 245 nests\nObserver lmckinnon gathered 249 nests\nObserver dhodkinson gathered 15 nests\nObserver mbwunder gathered 4 nests\nObserver None gathered 0 nests\nObserver kkalasz gathered 12 nests\nObserver bhill gathered 55 nests\nObserver ssaalfeld gathered 13 nests\nObserver wenglish gathered 18 nests\nObserver lworing gathered 14 nests\nObserver vloverti gathered 54 nests\nObserver rlanctot gathered 40 nests\nObserver abankert gathered 17 nests\nObserver edastrous gathered 38 nests\nObserver jzamuido gathered 11 nests\nObserver amould gathered 42 nests\nObserver bkaselow gathered 4 nests\nObserver jflamarre gathered 43 nests\n\n\n\nPandas\n\nimport pandas as pd\n\n\ndf = pd.read_sql(\"SELECT * FROM Site\", conn)\n\n/var/folders/rl/j368fbbx25l937pdxgzdpmxm0000gq/T/ipykernel_18456/2832309421.py:1: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.\n df = pd.read_sql(\"SELECT * FROM Site\", conn)\n\n\n\ndf\n\n\n\n\n\n\n\n\n\nCode\nSite_name\nLocation\nLatitude\nLongitude\nArea\n\n\n\n\n0\nbarr\nBarrow\nAlaska, USA\n71.300003\n-156.600006\n220.399994\n\n\n1\nburn\nBurntpoint Creek\nOntario, Canada\n55.200001\n-84.300003\n63.000000\n\n\n2\nbylo\nBylot Island\nNunavut, Canada\n73.199997\n-80.000000\n723.599976\n\n\n3\ncakr\nCape Krusenstern\nAlaska, USA\n67.099998\n-163.500000\n54.099998\n\n\n4\ncari\nCanning River Delta\nAlaska, USA\n70.099998\n-145.800003\n722.000000\n\n\n5\nchau\nChaun River Delta\nChukotka, Russia\n68.800003\n170.600006\n248.199997\n\n\n6\nchur\nChurchill\nManitoba, Canada\n58.700001\n-93.800003\n866.900024\n\n\n7\ncoat\nCoats Island\nNunavut, Canada\n62.900002\n-82.500000\n1239.099976\n\n\n8\ncolv\nColville River Delta\nAlaska, USA\n70.400002\n-150.699997\n324.799988\n\n\n9\neaba\nEast Bay\nNunavut, Canada\n64.000000\n-81.699997\n1205.500000\n\n\n10\niglo\nIgloolik\nNunavut, Canada\n69.400002\n-81.599998\n59.799999\n\n\n11\nikpi\nIkpikpuk\nAlaska, USA\n70.599998\n-154.699997\n174.100006\n\n\n12\nlkri\nLower Khatanga River\nKrasnoyarsk, Russia\n72.900002\n106.099998\n270.899994\n\n\n13\nmade\nMackenzie River Delta\nNorthwest Territories, Canada\n69.400002\n-135.000000\n667.299988\n\n\n14\nnome\nNome\nAlaska, USA\n64.400002\n-164.899994\n90.099998\n\n\n15\nprba\nPrudhoe Bay\nAlaska, USA\n70.300003\n-148.600006\n120.000000" + "text": "Please use Canvas to return the assignments: https://ucsb.instructure.com/courses/19301/assignments/236835\nCreate a physical ER (entity-relationship) diagram for the Harry Potter tables shown in class. It will be helpful to refer back to the slides.\nAs discussed briefly in class, a logical or conceptual ER diagram focuses on high-level abstractions, and doesn’t address how entities and relationships actually get implemented. In particular, in a logical ER diagram a many-to-many relationship between two entities might be represented by a simple line, even though in implementation a many-to-many relationship requires a separate table to store the relationship tuples. By contrast, a physical ER diagram describes actual tables. You are being asked to create a physical ER diagram.\nRequirements:\n\nYour diagram should include Student, House, Wand, Course, and Enrollment tables.\nEach table should list the name of the entity, any attributes, and which attribute(s) form the primary key, if there is one.\nA foreign key relationship from an attribute in one table to an attribute in another table should be indicated by a line between the two attributes. The ends of the lines should reflect the cardinalities at each end. See the example below.\n\nTwist #1! The slides shown in class demonstrated a many-to-one relationship between wands and students, i.e., one student might own multiple wands, but any given wand has only one owner. However, for this exercise, you are being asked to model a many-to-many relationship between wands and students (it happened in the books that the same wand was used by different students, though at different times, of course). To create a many-to-many relationship, you will need to invent an intermediate table that represents the student-wand ownership relation, in the same way the Enrollment table intermediates between the Student and Course tables.\nTwist #2! You must also store the date range (i.e., begin date and end date) of wand ownership. You will need to think where these date attributes belong. Are they attributes of a student? Of a wand? Of something else?\nVarious symbologies have been developed for ER diagrams. For this assignment, represent the “one” side of a many-to-one relationship by a single vertical bar, and represent the “many” side by a so-called crow’s foot. In the end, your diagram should visually resemble something like this:\n\nYou can use a tool like dbdiagram.io as was used to create the above diagram, or any other drawing tool. Or you can just draw it by hand and take a picture with your phone. Regardless of the method, be sure to indicate primary keys somehow (bold text, underlined text, add “PK” next to the attribute, etc., whatever works visually)." }, { - "objectID": "modules/week04/hw-04-3.html", - "href": "modules/week04/hw-04-3.html", - "title": "Week 4 - Who’s the culprit?", + "objectID": "modules/week01/index-01.html", + "href": "modules/week01/index-01.html", + "title": "Week 1 - Relational databases and data modeling", "section": "", - "text": "You’re reading up on how eggs are aged by floating them in water 1:\nwhen you receive an urgent phone call from a colleague who says they just discovered that an observer, who worked at the “nome” site between 1998 and 2008 inclusive, had been floating eggs in salt water and not freshwater. The density of salt water being different, those measurements are incorrect and need to be adjusted. The colleague says that this incorrect technique was used on exactly 36 nests, but before you can ask who the observer was, the phone is disconnected. Who made this error? That is, looking at nest data for “nome” between 1998 and 2008 inclusive, and for which egg age was determined by floating, can you determine the name of the observer who observed exactly 36 nests? Please submit your SQL. Your SQL should return exactly one row, the answer. That is, your query should produce:" - }, - { - "objectID": "modules/week04/hw-04-3.html#footnotes", - "href": "modules/week04/hw-04-3.html#footnotes", - "title": "Week 4 - Who’s the culprit?", - "section": "Footnotes", - "text": "Footnotes\n\n\nLiebezeit, Joseph R., et al. “Assessing the Development of Shorebird Eggs Using the Flotation Method: Species-Specific and Generalized Regression Models.” The Condor, vol. 109, no. 1, 2007, pp. 32–47. JSTOR, http://www.jstor.org/stable/4122529↩︎" + "text": "Benefits of relational databases\nRelational data model and SQL data definition\nData modeling" }, { - "objectID": "modules/week04/index-04.html", - "href": "modules/week04/index-04.html", - "title": "Week 4 - SQL and DuckDB", + "objectID": "modules/week01/index-01.html#learning-objectives", + "href": "modules/week01/index-01.html#learning-objectives", + "title": "Week 1 - Relational databases and data modeling", "section": "", - "text": "Continued exploration of SQL concepts including joins, views, and set operations; and apply it to conduct data analysis." + "text": "Benefits of relational databases\nRelational data model and SQL data definition\nData modeling" }, { - "objectID": "modules/week04/index-04.html#learning-objectives", - "href": "modules/week04/index-04.html#learning-objectives", - "title": "Week 4 - SQL and DuckDB", - "section": "", - "text": "Continued exploration of SQL concepts including joins, views, and set operations; and apply it to conduct data analysis." + "objectID": "modules/week01/index-01.html#slides", + "href": "modules/week01/index-01.html#slides", + "title": "Week 1 - Relational databases and data modeling", + "section": "Slides", + "text": "Slides\nslides-01.pptx" }, { - "objectID": "modules/week04/index-04.html#slides-and-other-materials", - "href": "modules/week04/index-04.html#slides-and-other-materials", - "title": "Week 4 - SQL and DuckDB", - "section": "Slides and other materials", - "text": "Slides and other materials\nslides-04.pptx\nLecture notes:\n\nlecture-notes-04-mon.txt\nclass-script-04-mon.sql\nclass-script-04-wed-empty.sql\nclass-script-04-wed-solution.sql\n\nData:\n\nClass data GitHub repository, week 3 – for Monday\nClass data GitHub repository, week 4 – for Wednesday only\nASDN dataset ER (entity-relationship) diagram" + "objectID": "modules/week01/index-01.html#resources", + "href": "modules/week01/index-01.html#resources", + "title": "Week 1 - Relational databases and data modeling", + "section": "Resources", + "text": "Resources\n\nhttps://learning.nceas.ucsb.edu/2023-06-delta/session_09.html\n\nVery brief introduction to data modeling, ties into “tidy data.”\n\nChristoph Wohner, Johannes Peterseil, and Hermann Klug (2022). Designing and implementing a data model for describing environmental monitoring and research sites. Ecological Informatics 70, 101708.\nhttps://doi.org/10.1016/j.ecoinf.2022.101708\n\nGood case study.\n\nGerald A. Burnette (2022). Managing environmental data: principles, techniques, and best practices. CRC Press.\nAccess via Library Catalog\n\nComprehensive text, specific to environmental sciences.\n\nGraeme C. Simsion and Graham C. Witt (2005). Data Modeling Essentials. 3rd ed. Amsterdam: Morgan Kaufmann.\nAccess via Library Catalog\nGoogle Books\n\nComprehensive text, not specific to the environmental sciences.\n\nHartmut Hebbel (1994). Environmental data modeling. Annals of Operations Research 54, 263-278.\nhttps://doi.org/10.1007/BF02031737\n\nA broader view of data organization.\n\nJeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.\nAccess via Library Catalog\n\nComplete but theoretical introduction to relational databases, data modeling, and relational algebra." }, { - "objectID": "modules/week04/index-04.html#homework", - "href": "modules/week04/index-04.html#homework", - "title": "Week 4 - SQL and DuckDB", + "objectID": "modules/week01/index-01.html#homework", + "href": "modules/week01/index-01.html#homework", + "title": "Week 1 - Relational databases and data modeling", "section": "Homework", - "text": "Homework\nMissing data\nWho worked with whom?\nWho’s the culprit?" - }, - { - "objectID": "modules/week08/case-a.html", - "href": "modules/week08/case-a.html", - "title": "Case Study A: Containing the flames of bias in machine learning", - "section": "", - "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nWildfires have become increasingly common and destructive in many regions worldwide, causing significant environmental and social problems. In response, many communities have implemented fire prevention and management strategies, including using machine learning (ML) algorithms to predict and mitigate the risk of wildfires.\nOakdale, located in a densely forested area in British Columbia, Canada, has implemented an ML algorithm to predict the risk of wildfires and prioritize fire prevention resources. The algorithm uses a variety of inputs, including historical fire data, weather patterns, topography, and vegetation coverage, to generate a risk score for each area of the city. However, after several months of using the algorithm, city officials noticed that specific neighborhoods with low-income and minority populations consistently receive lower risk scores than other areas with very similar environmental conditions. Upon closer examination of those patterns in the data, they realized that the historical data used to train the algorithm was heavily concentrated on more affluent and predominantly white neighborhoods, resulting in a skewed view of the fire risks for the whole city." - }, - { - "objectID": "modules/week08/case-a.html#instructions", - "href": "modules/week08/case-a.html#instructions", - "title": "Case Study A: Containing the flames of bias in machine learning", - "section": "", - "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nWildfires have become increasingly common and destructive in many regions worldwide, causing significant environmental and social problems. In response, many communities have implemented fire prevention and management strategies, including using machine learning (ML) algorithms to predict and mitigate the risk of wildfires.\nOakdale, located in a densely forested area in British Columbia, Canada, has implemented an ML algorithm to predict the risk of wildfires and prioritize fire prevention resources. The algorithm uses a variety of inputs, including historical fire data, weather patterns, topography, and vegetation coverage, to generate a risk score for each area of the city. However, after several months of using the algorithm, city officials noticed that specific neighborhoods with low-income and minority populations consistently receive lower risk scores than other areas with very similar environmental conditions. Upon closer examination of those patterns in the data, they realized that the historical data used to train the algorithm was heavily concentrated on more affluent and predominantly white neighborhoods, resulting in a skewed view of the fire risks for the whole city." - }, - { - "objectID": "modules/week08/case-a.html#questions", - "href": "modules/week08/case-a.html#questions", - "title": "Case Study A: Containing the flames of bias in machine learning", - "section": "Questions", - "text": "Questions\n\nQuestion 1\nThis case presents an ethical concern primarily associated with what?\n\n\nQuestion 2\nAccording to McGovern et al. (2022), which AI/ML issues can be identified in this case study? Justify your answer.\n\n\nQuestion 3\nSuppose you were hired as a consultant by Oakdale’s city officials. Which of the following recommendations would you give them to prevent perpetuating bias and inequitable outcomes? (Select all that apply)\n\nImplement transparency measures that make the algorithms’ decision-making processes more visible and understandable to stakeholders. This may include clarifying how decisions are made, sharing data sources, and providing access to model outputs. Fully document any limitations and shortcomings of the model and data.\nInvolve diverse stakeholders in the algorithm development and testing, including individuals from communities whose outputs may disproportionately impact. This can help identify and address potential biases and ensure that the algorithm is designed with the interests of all community members in mind.\nContinue using the algorithm as the official decision-making source until the re-training is completed. After all, ML methods are more efficient than traditional fire prevention strategies (e.g., fire breaks and vegetation management)." + "text": "Homework\nCreate an ER diagram\nData modeling exercise" }, { - "objectID": "modules/week08/case-d.html", - "href": "modules/week08/case-d.html", - "title": "eds213", + "objectID": "modules/week03/hw-03-2.html", + "href": "modules/week03/hw-03-2.html", + "title": "Week 3 - SQL problem 2", "section": "", - "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nTina, a researcher working on coastal vulnerability analysis in Southern California, acquired LiDAR data from a vendor in 2017. Based on the acquired dataset, she submitted a paper to a high-impact academic journal early this year. The paper was accepted but is pending publication until Tina complies with the mandate of sharing supporting data and associated documentation in an open repository. While inspecting the data documentation, Max, the repository data manager, noticed that the files included raw and processed data from a vendor; however, no explicit declaration of authorization to share the data was included in the submission package. Tina presented an invoice of $20,000 USD certifying that she obtained the data and said she was told verbally that the data was not subject to any use restrictions." + "text": "Part 1\nIf we want to know which site has the largest area, it’s tempting to say\nSELECT Site_name, MAX(Area) FROM Site;\nWouldn’t that be great? But DuckDB gives an error. And right it should! This query is conceptually flawed. Please describe what is wrong with this query. Don’t just quote DuckDB’s error message— explain why DuckDB is objecting to performing this query.\nTo help you answer this question, you may want to consider:\n\nTo the database, the above query is no different from\n\nSELECT Site_name, AVG(Area) FROM Site\nSELECT Site_name, COUNT(*) FROM Site\nSELECT Site_name, SUM(Area) FROM Site\n\nIn all these examples, the database sees that it is being asked to apply an aggregate function to a table column.\nWhen performing an aggregation, SQL wants to collapse the requested columns down to a single row. (For a table-level aggregation such as requested above, it wants to collapse the entire table down to a single row. For a GROUP BY, it wants to collapse each group down to a single row.)\n\n\n\nPart 2\nTime for plan B. Find the site name and area of the site having the largest area. Do so by ordering the rows in a particularly convenient order, and using LIMIT to select just the first row. Your result should look like:\n┌──────────────┬────────┐\n│ Site_name │ Area │\n│ varchar │ float │\n├──────────────┼────────┤\n│ Coats Island │ 1239.1 │\n└──────────────┴────────┘\nPlease submit your SQL.\n\n\nPart 3\nDo the same, but use a nested query. First, create a query that finds the maximum area. Then, create a query that selects the site name and area of the site whose area equals the maximum. Your overall query will look something like:\nSELECT Site_name, Area FROM Site WHERE Area = (SELECT ...);" }, { - "objectID": "modules/week08/case-d.html#instructions", - "href": "modules/week08/case-d.html#instructions", - "title": "eds213", + "objectID": "modules/week03/index-03.html", + "href": "modules/week03/index-03.html", + "title": "Week 3 - Structured Query Language (SQL) & DuckDB", "section": "", - "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nTina, a researcher working on coastal vulnerability analysis in Southern California, acquired LiDAR data from a vendor in 2017. Based on the acquired dataset, she submitted a paper to a high-impact academic journal early this year. The paper was accepted but is pending publication until Tina complies with the mandate of sharing supporting data and associated documentation in an open repository. While inspecting the data documentation, Max, the repository data manager, noticed that the files included raw and processed data from a vendor; however, no explicit declaration of authorization to share the data was included in the submission package. Tina presented an invoice of $20,000 USD certifying that she obtained the data and said she was told verbally that the data was not subject to any use restrictions." - }, - { - "objectID": "modules/week08/case-d.html#questions", - "href": "modules/week08/case-d.html#questions", - "title": "eds213", - "section": "Questions", - "text": "Questions\n\nQuestion 1\nMax should advise Tina to acquire explicit permission from the data vendor to share the data.\n\nTrue\nFalse\n\n\n\nQuestion 2\nBecause Tina paid for the data, Max can move forward with the data publication without infringing on any legal and ethical aspects.\n\nTrue\nFalse\n\n\n\nQuestion 3\nIf Tina does not acquire explicit permission from the vendor to share the data, Max can’t publish the data in the repository.\n\nTrue\nFalse\n\n\n\nQuestion 4\nIf Tina does not acquire written permission to share the data, Max can suggest Tina share only aggregated data.\n\nTrue\nFalse" + "text": "Understand the relationship of SQL to relational databases\nUnderstand how local databases differ from client/server databases\nUnderstand basic SQL syntax and statements\nBe able to answer basic questions about data" }, { - "objectID": "modules/week08/index-08.html", - "href": "modules/week08/index-08.html", - "title": "Week 8 - Ethical and responsible data management", + "objectID": "modules/week03/index-03.html#learning-objectives", + "href": "modules/week03/index-03.html#learning-objectives", + "title": "Week 3 - Structured Query Language (SQL) & DuckDB", "section": "", - "text": "Understand fundamental ethical and responsible data management principles, focusing on the importance of data documentation, preventing bias and harm, properly handling sensitive data, ownership, and licensing issues\nRelate ethical and responsible data management principles to real-world scenarios" + "text": "Understand the relationship of SQL to relational databases\nUnderstand how local databases differ from client/server databases\nUnderstand basic SQL syntax and statements\nBe able to answer basic questions about data" }, { - "objectID": "modules/week08/index-08.html#learning-objectives", - "href": "modules/week08/index-08.html#learning-objectives", - "title": "Week 8 - Ethical and responsible data management", - "section": "", - "text": "Understand fundamental ethical and responsible data management principles, focusing on the importance of data documentation, preventing bias and harm, properly handling sensitive data, ownership, and licensing issues\nRelate ethical and responsible data management principles to real-world scenarios" + "objectID": "modules/week03/index-03.html#slides-and-other-materials", + "href": "modules/week03/index-03.html#slides-and-other-materials", + "title": "Week 3 - Structured Query Language (SQL) & DuckDB", + "section": "Slides and other materials", + "text": "Slides and other materials\nslides-03.pptx\nLecture notes:\n\nlecture-notes-03-mon.txt\nlecture-notes-03-wed.txt\nclass-script-03-wed.sql\n\nASDN dataset ER (entity-relationship) diagram\nClass data GitHub repository, week 3" }, { - "objectID": "modules/week08/index-08.html#slides", - "href": "modules/week08/index-08.html#slides", - "title": "Week 8 - Ethical and responsible data management", - "section": "Slides", - "text": "Slides\nslides-08.pptx\nSuggested readings\n\nBoté, J. J., & Térmens, M. (2019). Reusing data: Technical and ethical challenges. DESIDOC Journal of Library & Information Technology, 39(6) http://hdl.handle.net/2445/151341\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\n\nAdditional suggested readings are noted in the slides." + "objectID": "modules/week03/index-03.html#resources", + "href": "modules/week03/index-03.html#resources", + "title": "Week 3 - Structured Query Language (SQL) & DuckDB", + "section": "Resources", + "text": "Resources\n\nhttp://swcarpentry.github.io/sql-novice-survey/\n\nGood Carpentry lesson, our lesson is drawn from this.\n\nC.J. Date and Hugh Darwen (1993). A Guide to the SQL Standard. 3rd ed. Reading, MA: Addison-Wesley.\nAccess via Library Catalog\n\nThe ANSI standard.\n\nJoe Celko (1995). Joe Celko’s SQL For Smarties: Advanced SQL Programming. San Francisco, CA: Morgan Kaufmann.\nAccess via Library Catalog\n\nThis guy is an SQL guru. Newer versions of this book are available online, check the Library catalog (a bug is preventing me from linking directly).\n\nGrant Allen and Mike Owens (2010). The Definitive Guide to SQLite. 2nd ed. Berkeley, CA: Apress.\nAccess via Library Catalog\n\nGood reference. Can access online!\n\nJeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.\nAccess via Library Catalog\n\nComplete but theoretical introduction to relational databases, data modeling, and relational algebra." }, { - "objectID": "modules/week08/index-08.html#case-based-discussion", - "href": "modules/week08/index-08.html#case-based-discussion", - "title": "Week 8 - Ethical and responsible data management", - "section": "Case-based discussion:", - "text": "Case-based discussion:\n\nCase Study A: Containing the flames of bias in machine learning\nCase Study B: The caveat is the caviar: navigating ethics to protect endangered river wildlife\nCase Study C: To reuse or not reuse, that is the key question!\nCase Study D: Navigating the complexities of ownership zones" + "objectID": "modules/week03/index-03.html#homework", + "href": "modules/week03/index-03.html#homework", + "title": "Week 3 - Structured Query Language (SQL) & DuckDB", + "section": "Homework", + "text": "Homework\nSQL problem 1\nSQL problem 2\nSQL problem 3" }, { - "objectID": "syllabus.html", - "href": "syllabus.html", - "title": "Course syllabus", + "objectID": "modules/week06/index-06.html", + "href": "modules/week06/index-06.html", + "title": "Week 6 - Programming with databases", "section": "", - "text": "This course will teach students the fundamentals of relational databases and data management. Students will learn the principles of database modeling and design and gain practical experience applying SQL (Structured Query Language) to manage and manipulate relational databases. The course also introduces the role and application of data documentation and metadata standards for interoperability and effective data management. By the end of the course, students will be equipped to make informed decisions about managing databases and data ethically and responsibly, focusing on issues such as bias, data privacy, sharing, ownership, and licensing." + "text": "Understand the basic database programming model\nAccess a DuckDB database from Python and R\nUnderstand how to use the Python/Pandas and R/dbplyr convenience functions" }, { - "objectID": "syllabus.html#overview", - "href": "syllabus.html#overview", - "title": "Course syllabus", + "objectID": "modules/week06/index-06.html#learning-objectives", + "href": "modules/week06/index-06.html#learning-objectives", + "title": "Week 6 - Programming with databases", "section": "", - "text": "This course will teach students the fundamentals of relational databases and data management. Students will learn the principles of database modeling and design and gain practical experience applying SQL (Structured Query Language) to manage and manipulate relational databases. The course also introduces the role and application of data documentation and metadata standards for interoperability and effective data management. By the end of the course, students will be equipped to make informed decisions about managing databases and data ethically and responsibly, focusing on issues such as bias, data privacy, sharing, ownership, and licensing." - }, - { - "objectID": "syllabus.html#learning-objectives", - "href": "syllabus.html#learning-objectives", - "title": "Course syllabus", - "section": "Learning objectives", - "text": "Learning objectives\n\nUnderstand the fundamental principles of relational databases and relational data modeling, including table structures, primary and foreign keys, relationships between tables, and data normalization.\nUnderstand how to use the Unix command line and how manage DuckDB databases from the command line.\nUse SQL to retrieve, manipulate, and manage data stored in a relational database.\nDemonstrate proficiency in querying, filtering, sorting, and programmatically accessing and interacting with relational databases from R and Python.\nBecome familiar with advanced database topics such as concurrency, transactions, indexing, backups, and publication.\nUnderstand the role of good data documentation and metadata standards for interoperability, effective data management, and reproducibility.\nOperationalize the FAIR principles into data management practices.\nProduce a metadata record in EML (Ecological Metadata Language) and apply metadata crosswalks to programmatically convert metadata schemas.\nUnderstand the ethics of sensitive data and how to de-identify sensitive data.\nEvaluate ethical and responsible data management practices, including bias, data privacy, sharing, ownership, and licensing issues." - }, - { - "objectID": "syllabus.html#schedule", - "href": "syllabus.html#schedule", - "title": "Course syllabus", - "section": "Schedule", - "text": "Schedule\n\nClass: Monday & Wednesday 9:30-10:45 am (NCEAS)\nDiscussion - session 1: Thur 1-1:50PM, Bren Hall 1510\nDiscussion - session 2: Thur 2-2:50PM, Bren Hall 1510\nOffice hours: Monday 11-12 pm (NCEAS)" + "text": "Understand the basic database programming model\nAccess a DuckDB database from Python and R\nUnderstand how to use the Python/Pandas and R/dbplyr convenience functions" }, { - "objectID": "syllabus.html#modules", - "href": "syllabus.html#modules", - "title": "Course syllabus", - "section": "Modules", - "text": "Modules\n\n\n\nWeek\nTopic/Content\n\n\n\n\n1\nRelational databases and data modeling\n\n\n2\nAnalyzing & cleaning the bird dataset (from csv)\n\n\n3\nIntroduction to SQL part 1 & DuckDB\n\n\n4\nImporting data in the database + SQL part 2\n\n\n5\nAnalyzing the bird database using SQL + bash programming\n\n\n6\nUsing R & python to query databases\n\n\n7\nDocumenting your work: metdata & computing environment\n\n\n8\nSensitive data\n\n\n9\nEthical & responsible data mgnt\n\n\n10\nData licensing and publication\n\n\n\n*Schedule subject to change." + "objectID": "modules/week06/index-06.html#slides-and-other-materials", + "href": "modules/week06/index-06.html#slides-and-other-materials", + "title": "Week 6 - Programming with databases", + "section": "Slides and other materials", + "text": "Slides and other materials\nslides-06.pptx\nCoding transcripts:\n\npython-programming.ipynb\npython-programming-cont.ipynb\nusing_dbplyr-06-wed-r.qmd" }, { - "objectID": "syllabus.html#course-assessment", - "href": "syllabus.html#course-assessment", - "title": "Course syllabus", - "section": "Course assessment", - "text": "Course assessment\nYour performance in this course will depend 90% on weekly homework assignments and 10% on class participation. There will be no graded exercises or homework the last week. We will use Canvas to manage the assignments." + "objectID": "modules/week06/index-06.html#resources", + "href": "modules/week06/index-06.html#resources", + "title": "Week 6 - Programming with databases", + "section": "Resources", + "text": "Resources\n\nhttps://peps.python.org/pep-0249/\n\nCommon Python-RDBMS API\n\nhttps://duckdb.org/docs/api/python/overview\n\nPython DuckDB module\n\nhttps://dbplyr.tidyverse.org/\n\nR dbplyr documentation" }, { - "objectID": "syllabus.html#attendance-and-homework-policy", - "href": "syllabus.html#attendance-and-homework-policy", - "title": "Course syllabus", - "section": "Attendance and homework policy", - "text": "Attendance and homework policy\nAttendance is required. Material will be given in class that is not covered by slides or background readings.\nHomework is expected to be turned in on time. A generous amount of time will be given to complete assignments. Do not wait until the last minute to work on homework in case something unexpected comes up. Homework turned in late will be docked 20% per day." + "objectID": "modules/week06/index-06.html#homework", + "href": "modules/week06/index-06.html#homework", + "title": "Week 6 - Programming with databases", + "section": "Homework", + "text": "Homework\nLittle Bobby Tables\nCharacterizing egg variation\nWho were the winners?" }, { - "objectID": "syllabus.html#code-of-conduct", - "href": "syllabus.html#code-of-conduct", - "title": "Course syllabus", - "section": "Code of conduct", - "text": "Code of conduct\nAll students are expected to read and comply with the UCSB Student Conduct Code. We expect cooperation from all members to help ensure a welcoming and inclusive environment for everybody. We are determined to make our courses welcoming, inclusive and harassment-free for everyone regardless of gender, gender identity and expression, race, age, sexual orientation, disability, physical appearance, or religion (or lack thereof). We do not tolerate harassment of class participants, teaching assistants, or instructors in any form. Derogatory, abusive, or demeaning language or imagery will not be tolerated." + "objectID": "modules/week06/hw-06-2.html", + "href": "modules/week06/hw-06-2.html", + "title": "Week 6 - Characterizing egg variation", + "section": "", + "text": "You read Egg Dimensions and Neonatal Mass of Shorebirds by Robert E. Ricklefs and want to see how the egg data we’ve been using in class compares to his results. Specifically, Ricklefs reported, “Coefficients of variation were 4 to 9% for egg volume” for shorebird eggs gathered in Manitoba, Canada. What is the range of coefficients of variation in our ASDN dataset?\nThe “coefficient of variation,” or CV, is a unitless measure of the variation of a sample, defined as the standard deviation divided by the mean and multiplied by 100 to express as a percentage. Thus, a CV of 10% means the standard deviation is 10% of the mean value. For the purposes of this computation, we will copy Ricklefs and use as a proxy for egg volume the formula\n\\[ W^2 L \\]\nwhere \\(W\\) is egg width and \\(L\\) is egg length.\nYour task is to create a Python program that reads data from the ASDN database and uses Pandas to compute, for each species in the database (for which there is egg data), the coefficient of variation of volume using the above formula. There are many ways this can be done. Because this assignment is primarily about programming in Python, please follow the steps below. Please submit your Python code when done.\n\nStep 1\nCreate a query that will return the distinct species for which there is egg data (not all species and not all nests have egg data), so that you can then loop over those species. Your query should return two columns, species code and scientific name. Please order the results in alphabetic order of scientific name.\n\n\nStep 2\nAfter you’ve connected to the database and created a cursor c, iterate over the species like so:\nspecies_query = \"\"\"SELECT Code, Scientific_name FROM...\"\"\"\nfor row in c.execute(species_query).fetchall(): # DuckDB lame-o workaround\n species_code = row[0]\n scientific_name = row[1]\n # query egg data for that species (step 3)\n # compute statistics and print results (step 4)\n\n\nStep 3\nYou will need to construct a query that gathers egg data for a given species, one species at a time; the species code will be a parameter to that query. You can compute the formula\n\\[ W^2 L \\]\nin SQL or in Pandas. For simplicity, SQL is suggested:\negg_query = \"\"\"SELECT Width*Width*Length AS Volume FROM...\"\"\"\nWithin the loop, you will want to execute the query on the current species in the loop iteration. You may use the Pandas function pd.read_sql to do so and also directly load the results into a dataframe:\ndf = pd.read_sql(egg_query, conn, ...)\nDo a help(pd.read_sql) to figure out how to pass parameters to a query.\nYou may get a bunch of warnings from Pandas about how it “only supports SQLAlchemy…”. Just ignore them. (Sorry about that.)\n\n\nStep 4\nFinally, and still within your loop, you’ll want to compute statistics and print out the results:\ncv = round(df.Volume.std()/df.Volume.mean()*100, 2)\nprint(f\"{scientific_name} {cv}%\")\nYour output should look like this:\nArenaria interpres 21.12%\nCalidris alpina 5.46%\nCalidris fuscicollis 16.77%\nCharadrius semipalmatus 8.99%\nPhalaropus fulicarius 4.65%\nPluvialis dominica 19.88%\nPluvialis squatarola 6.94%\n\n\nAppendix\nIt’s not necessary to use pd.read_sql to get data into a dataframe, it’s just a convenience. To do so manually (and to show you it’s not that hard), imagine that your query returns three columns. Collect the row data into three separate lists, then manually create a dataframe specifying the contents as a dictionary:\nrows = c.execute(\"SELECT Species, Width, Length FROM...\").fetchall()\nspecies_column = [t[0] for t in rows]\nwidth_column = [t[1] for t in rows]\nlength_column = [t[2] for t in rows]\n\ndf = pd.DataFrame(\n {\n \"species\": species_column,\n \"width\": width_column,\n \"length\": length_column\n }\n)" }, { - "objectID": "syllabus.html#student-support", - "href": "syllabus.html#student-support", - "title": "Course syllabus", - "section": "Student support", - "text": "Student support\nWe understand that ongoing crises impact students differently based on experiences, identities, living situations and resources, family responsibilities, and unforeseen challenges. We encourage you to prioritize your well-being. We are here to help you reach your learning and career goals. You are always welcome to reach out to our teaching team so that we can best support you. Please see the UCSB Campus Resource Guide for campus student support and services." + "objectID": "modules/week06/hw-06-3.html", + "href": "modules/week06/hw-06-3.html", + "title": "Week 6 - Who were the winners?", + "section": "", + "text": "At the conclusion of the ASDN project the PIs decided to hand out first, second, and third prizes to the observers who measured the most eggs. Who won? Please use R and dbplyr to answer this question, and please submit your R code. Your code should print out:\n# Ordered by: desc(total_eggs)\n Name total_eggs\n <chr> <int>\n1 Vanessa Loverti 163\n2 Dylan Kessler 87\n3 Richard Lanctot 50\nYou’ll want to load database tables using statements such as:\negg_table <- tbl(conn, \"Bird_eggs\")\nand then use tidyverse grouping, summarization, joining, and other functions to compute the desired result.\nAlso, take your final expression and pipe it into show_query(). If you used multiple R statements, did dbplyr create a temporary table, or did it manage to do everything in one query? Did it limit to the first three rows using an R expression or an SQL LIMIT clause?" }, { - "objectID": "syllabus.html#disabled-students-program", - "href": "syllabus.html#disabled-students-program", - "title": "Course syllabus", - "section": "Disabled students program", - "text": "Disabled students program\nStudents with disabilities and/or alternative learning needs are encouraged to work with the Disabled Students Program at UCSB to ensure we can best support your learning and success." + "objectID": "modules/week04/hw-04-1.html", + "href": "modules/week04/hw-04-1.html", + "title": "Week 4 - Missing data", + "section": "", + "text": "Which sites have no egg data? Please answer this question using all three techniques demonstrated in class. In doing so, you will need to work with the Bird_eggs table, the Site table, or both. As a reminder, the techniques are:\n\nUsing a Code NOT IN (subquery) clause.\nUsing an outer join with a WHERE clause that selects the desired rows. Caution: make sure your IS NULL test is performed against a column that is not ordinarily allowed to be NULL. You may want to consult the database schema to remind yourself of column declarations.\nUsing the set operation EXCEPT.\n\nAdd an ORDER BY clause to your queries so that all three produce the exact same result:\n┌─────────┐\n│ Code │\n│ varchar │\n├─────────┤\n│ barr │\n│ burn │\n│ bylo │\n│ cakr │\n│ cari │\n│ chau │\n│ coat │\n│ colv │\n│ iglo │\n│ ikpi │\n│ lkri │\n│ made │\n│ nome │\n│ prba │\n├─────────┤\n│ 14 rows │\n└─────────┘\nSubmit your SQL." }, { - "objectID": "resources.html", - "href": "resources.html", - "title": "Resources", + "objectID": "modules/week04/hw-04-2.html", + "href": "modules/week04/hw-04-2.html", + "title": "Week 4 - Who worked with whom?", "section": "", - "text": "Burnette. (2022). Managing environmental data: principles, techniques, and best practices. CRC Press. https://search.library.ucsb.edu/permalink/01UCSB_INST/1aqck9j/alma9917095897506531\nChapman, AD, & Grafton, O. (2008). Guide to Best Practices for Generalising Sensitive Species-Occurrence Data (v1.0). https://doi.org/10.15468/doc-b02j-gt10\nCrystal-Ornelas, R., Varadharajan, C., O’Ryan, D., Ramírez-Muñoz, J., Jones, M. B., Lehnert, K. A., … & Servilla, M. (2022). Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats. Scientific Data, 9(1), 700. https://doi.org/10.1038/s41597-022-01606-w\nJones, M. B., O’Brien, M., Mecum, B., Boettiger, C., Schildhauer, M., Maier, M., Whiteaker, T., Earl, S., & Chong, S. (2019). Ecological Metadata Language version 2.2.0. KNB Data Repository. https://doi.org/10.5063/F11834T2\nLabastida, I., & Margoni, T. (2020). Licensing FAIR Data for Reuse. Data Intelligence, 2(1-2), 199-207. https://doi.org/10.1162/dint_a_00042\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\nRecknagel, F., & Michener, W. K. (Eds.). (2018). Ecological informatics: Data management and knowledge discovery (3rd ed.). Springer." + "text": "The Camp_assignment table lists where each person worked and when. Your goal is to answer, Who worked with whom? That is, you are to find all pairs of people who worked at the same site, and whose date ranges overlap while at that site. This can be solved using a self-join.\nA self-join of a table is a regular join, but instead of joining two different tables, we join two copies of the same table, which we will call the “A” copy and the “B” copy:\nThe idea is that the above join will give us rows that pair every person/site/date range with every other person/site/date range. With no conditions on the join, since there are \\(441\\) rows in the Camp_assignment table, the join will produce \\(441^2 = 194481\\) rows. But out of all those rows we want only those where the two people worked at the same site. So:\nSubmit your final SQL query." }, { - "objectID": "resources.html#bibliography", - "href": "resources.html#bibliography", - "title": "Resources", - "section": "", - "text": "Burnette. (2022). Managing environmental data: principles, techniques, and best practices. CRC Press. https://search.library.ucsb.edu/permalink/01UCSB_INST/1aqck9j/alma9917095897506531\nChapman, AD, & Grafton, O. (2008). Guide to Best Practices for Generalising Sensitive Species-Occurrence Data (v1.0). https://doi.org/10.15468/doc-b02j-gt10\nCrystal-Ornelas, R., Varadharajan, C., O’Ryan, D., Ramírez-Muñoz, J., Jones, M. B., Lehnert, K. A., … & Servilla, M. (2022). Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats. Scientific Data, 9(1), 700. https://doi.org/10.1038/s41597-022-01606-w\nJones, M. B., O’Brien, M., Mecum, B., Boettiger, C., Schildhauer, M., Maier, M., Whiteaker, T., Earl, S., & Chong, S. (2019). Ecological Metadata Language version 2.2.0. KNB Data Repository. https://doi.org/10.5063/F11834T2\nLabastida, I., & Margoni, T. (2020). Licensing FAIR Data for Reuse. Data Intelligence, 2(1-2), 199-207. https://doi.org/10.1162/dint_a_00042\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\nRecknagel, F., & Michener, W. K. (Eds.). (2018). Ecological informatics: Data management and knowledge discovery (3rd ed.). Springer." + "objectID": "modules/week04/hw-04-2.html#bonus-problem", + "href": "modules/week04/hw-04-2.html#bonus-problem", + "title": "Week 4 - Who worked with whom?", + "section": "Bonus problem!", + "text": "Bonus problem!\nProduce this much nicer table by joining with the Personnel table:\n┌─────────┬─────────────────────┬───────────────────┐\n│ Site │ Name_1 │ Name_2 │\n│ varchar │ varchar │ varchar │\n├─────────┼─────────────────────┼───────────────────┤\n│ lkri │ Anastasia Popovkina │ Gleb Sedash │\n│ lkri │ Anastasia Popovkina │ Julya Loshchagina │\n└─────────┴─────────────────────┴───────────────────┘\nYou’ll need to join with the Personnel table twice, once for each observer column. You may need give abbreviations to tables (e.g., JOIN Personnel AS p1) to distinguish the tables and columns. You can do it!" }, { - "objectID": "about.html", - "href": "about.html", - "title": "About RDS", + "objectID": "modules/week08/case-b.html", + "href": "modules/week08/case-b.html", + "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", "section": "", - "text": "Research Data Services (RDS) helps UCSB researchers manage and preserve their research data through:\n\nConsultations\nLong-term engagements\nInstructional workshops\n\nOur team offers support across the research data lifecycle, from pre-project planning to post-project archival, connecting researchers with both locally- and externally-provided curation services. Our goal is to ensure that all research data is well-described, FAIR (Findable, Accessible, Interoperable, Reusable), and sustainably preservable, and that researchers receive scholarly credit for sharing and publishing data.\nContact us if you have any questions: rds@library.ucsb.edu" + "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nA team of environmental scientists from the University of Wisconsin is researching the impact of climate change on the Mississippi River wildlife. The research involves collecting data on various environmental factors, including water temperature, salinity, pH, and nutrient levels, as well as data on the abundance and diversity of aquatic species on the lower basins of the river.\nThe team collects the data using various methods, including river sensors, underwater cameras, and traditional sampling techniques. The data is stored on a cloud-based platform, allowing sharing and real-time collaboration with multiple partners. However, the researchers soon realized that some of the collected data might be sensitive, including information on the distribution and abundance of the pallid sturgeon, recently listed by the Wisconsin Department of Natural Resources as one of the endangered species. They also realized that some data points might be of commercial interest, which can threaten the protection of the species and its habitat. For example, exposing the exact locations of the pallid sturgeon and their critical habitats could lead to increased poaching or unauthorized fishing, as the species is highly valued for its caviar.\nResearchers face ethical considerations related to sensitive data sharing. They want to comply with best practices in open science. However, in this process, they must balance the need to share their data openly to advance research and understanding of climate change and the Mississippi River ecosystem while protecting sensitive data and preventing it from being used for commercial purposes that could harm the environment." }, { - "objectID": "installing-duckdb.html", - "href": "installing-duckdb.html", - "title": "Installing duckDB", + "objectID": "modules/week08/case-b.html#instructions", + "href": "modules/week08/case-b.html#instructions", + "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", "section": "", - "text": "DuckDB has been installed on the MEDS server. We also recommend to install it on your personal machine following those instructions: https://duckdb.org/docs/installation/" + "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nA team of environmental scientists from the University of Wisconsin is researching the impact of climate change on the Mississippi River wildlife. The research involves collecting data on various environmental factors, including water temperature, salinity, pH, and nutrient levels, as well as data on the abundance and diversity of aquatic species on the lower basins of the river.\nThe team collects the data using various methods, including river sensors, underwater cameras, and traditional sampling techniques. The data is stored on a cloud-based platform, allowing sharing and real-time collaboration with multiple partners. However, the researchers soon realized that some of the collected data might be sensitive, including information on the distribution and abundance of the pallid sturgeon, recently listed by the Wisconsin Department of Natural Resources as one of the endangered species. They also realized that some data points might be of commercial interest, which can threaten the protection of the species and its habitat. For example, exposing the exact locations of the pallid sturgeon and their critical habitats could lead to increased poaching or unauthorized fishing, as the species is highly valued for its caviar.\nResearchers face ethical considerations related to sensitive data sharing. They want to comply with best practices in open science. However, in this process, they must balance the need to share their data openly to advance research and understanding of climate change and the Mississippi River ecosystem while protecting sensitive data and preventing it from being used for commercial purposes that could harm the environment." }, { - "objectID": "installing-duckdb.html#installation", - "href": "installing-duckdb.html#installation", - "title": "Installing duckDB", + "objectID": "modules/week08/case-b.html#questions", + "href": "modules/week08/case-b.html#questions", + "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", + "section": "Questions", + "text": "Questions\n\nQuestion 1\nAs a data manager, what recommendations would you offer to the researchers to avoid commercial exploitation of the pallid sturgeon while contributing to advancing the research in the field? You do not need to write a long essay; elaborating your advice in bullet points is enough.\n\n\nQuestion 2\nAs a general rule, the _____________ of endangered, sensitive species should not be shared publicly. (hint: 1-2 words)" + }, + { + "objectID": "modules/week08/case-c.html", + "href": "modules/week08/case-c.html", + "title": "Case Study C: To reuse or not reuse, that is the key question!", "section": "", - "text": "DuckDB has been installed on the MEDS server. We also recommend to install it on your personal machine following those instructions: https://duckdb.org/docs/installation/" + "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nAdam is a researcher for a non-profit organization dedicated to accelerating the adoption of solar energy. The organization relies on data from various sources, including sensors, satellite imagery, and field measurements, to inform solar energy allocation, usage, and conservation decisions.\nRecently, he identified an available dataset containing data on the solar energy market size, including trends, competition, and customer demand. These data can inform business and policy decisions related to solar energy adoption. Adam is particularly excited because this is a multivariate time series dataset from the past ten years. Also, the data documentation listed many important variables for his project, including the compound annual growth rates (CAGR) for solar energy companies. However, when Adam inspected some of the data files, he noticed a few data points that needed to be corrected. For example, some rows had NAs; others had 000, 999, and -999 or were utterly blank; the documentation does not help him infer those values.\nWhen he contacted the corresponding researcher for clarification, he was told these inconsistencies could have been caused either due to system migration or by human error in inaccurate data entry. The researcher mentioned that his team had multiple contributors throughout the years and noted there were no enforced validation rules or data quality checks. Ultimately, Adam should choose a solution that balances the benefits of using the existing dataset with the potential risks of using incomplete or inaccurate data.\nAdam faces a dilemma. On the one hand, the dataset could provide valuable insights into the solar energy market and inform better policies and management decisions. On the other hand, the missing and anomalous data could affect the dataset’s overall quality and integrity, potentially leading to incorrect conclusions and decisions." }, { - "objectID": "installing-duckdb.html#visual-code-integration-optional", - "href": "installing-duckdb.html#visual-code-integration-optional", - "title": "Installing duckDB", - "section": "Visual code integration (optional)", - "text": "Visual code integration (optional)\nYou can use DuckDB directly from the terminal but if you also want to have the option to have both a SQL script and the terminal open, we recommend to use visual code. There is one setting to be done to link the script and the terminal:\nIn Visual Code:\n\nopen the palette Shift+Cmd+P and search for: Preferences: Open Keyboard Shortcuts (JSON) \nenter the following text & save\n\n// Place your key bindings in this file to override the defaults\n[\n {\n \"key\": \"shift+enter\",\n \"command\": \"workbench.action.terminal.runSelectedText\"\n }\n]\nOR alternative method to set this up:\nUnder the main VS menu, go to Settings -> Keyboard Shortcuts, search for “run selected”, the “run selected text in active terminal” will be one of the options. Set the shift-enter keypress by double-clicking. Right-click on the When column, select Change When Expression, and add editorLangId == 'sql' to restrict to just SQL files and not every kind of file.\nNow you can hit Shift+Return at then end of a line in your SQL script and it should run the command directly in the terminal!" + "objectID": "modules/week08/case-c.html#instructions", + "href": "modules/week08/case-c.html#instructions", + "title": "Case Study C: To reuse or not reuse, that is the key question!", + "section": "", + "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nAdam is a researcher for a non-profit organization dedicated to accelerating the adoption of solar energy. The organization relies on data from various sources, including sensors, satellite imagery, and field measurements, to inform solar energy allocation, usage, and conservation decisions.\nRecently, he identified an available dataset containing data on the solar energy market size, including trends, competition, and customer demand. These data can inform business and policy decisions related to solar energy adoption. Adam is particularly excited because this is a multivariate time series dataset from the past ten years. Also, the data documentation listed many important variables for his project, including the compound annual growth rates (CAGR) for solar energy companies. However, when Adam inspected some of the data files, he noticed a few data points that needed to be corrected. For example, some rows had NAs; others had 000, 999, and -999 or were utterly blank; the documentation does not help him infer those values.\nWhen he contacted the corresponding researcher for clarification, he was told these inconsistencies could have been caused either due to system migration or by human error in inaccurate data entry. The researcher mentioned that his team had multiple contributors throughout the years and noted there were no enforced validation rules or data quality checks. Ultimately, Adam should choose a solution that balances the benefits of using the existing dataset with the potential risks of using incomplete or inaccurate data.\nAdam faces a dilemma. On the one hand, the dataset could provide valuable insights into the solar energy market and inform better policies and management decisions. On the other hand, the missing and anomalous data could affect the dataset’s overall quality and integrity, potentially leading to incorrect conclusions and decisions." }, { - "objectID": "installing-duckdb.html#test", - "href": "installing-duckdb.html#test", - "title": "Installing duckDB", - "section": "Test", - "text": "Test\nLet’s test our new installation using visual code\n\nopen a terminal from the terminal menu -> new terminal\nIn the terminal type duckdb, this should start duckdb\nOpen a New text file: file menu -> new text file\nCopy paste the following code:\n\n-- Start the DB at the terminal: duckdb\n\nCREATE TABLE ducks AS SELECT 3 As age, 'mandarin' AS breed;\n\nSHOW TABLES;\n\nFROM ducks SELECT *;\n\nUse Shift+Return to run the SQL code line by line\n\nYou should have something that looks like this:" + "objectID": "modules/week08/case-c.html#questions", + "href": "modules/week08/case-c.html#questions", + "title": "Case Study C: To reuse or not reuse, that is the key question!", + "section": "Questions", + "text": "Questions\n\nQuestion 1\nSuppose Adam is leaning toward reusing the dataset despite the identified problems. What general ethical and responsible steps would you advise him to take moving forward? (Select all that apply)\n\nAdam should carefully examine the dataset to map all existing issues to determine the extent of missing and anomalous data and how it could affect the accuracy of his analysis.\nIf Adam collects new data, he should enforce data validation rules to tables and perform quality checks to avoid similar problems.\nIf Adam integrates new data and perform all necessarily adjustments to remove uncertainties and other problems in the data, he no longer needs to attribute original data creators.\nIf the dataset has significant missing or anomalous data, Adam may need to collect additional data to ensure the analysis is accurate and representative. This may involve additional time and resources but could lead to more accurate and reliable insights.\nAdam may be able to proceed with the analysis after carefully documenting and accounting for any uncertainties or limitations in the data.\nAdam should produce new documentation for the dataset based on all improvements he makes to the original data." }, { "objectID": "index.html", @@ -609,242 +497,347 @@ "text": "Modules\n\n\n\nWeek\nTopic/Content\n\n\n\n\n1\nRelational databases and data modeling\n\n\n2\nAnalyzing & cleaning the bird dataset (from csv)\n\n\n3\nIntroduction to SQL (part 1) & DuckDB\n\n\n4\nSQL part 2 + Analyzing the bird database using SQL\n\n\n5\nI/O & data management + Advanced database topics (indexing, triggers, …)\n\n\n6\nUsing R & Python to query databases + bash programming\n\n\n7\nDocumenting your work: metadata & computing environment\n\n\n8\nEthical & responsible data mgnt\n\n\n9\nSensitive data\n\n\n10\nData licensing and publication" }, { - "objectID": "modules/week08/case-c.html", - "href": "modules/week08/case-c.html", - "title": "Case Study C: To reuse or not reuse, that is the key question!", + "objectID": "installing-duckdb.html", + "href": "installing-duckdb.html", + "title": "Installing duckDB", "section": "", - "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nAdam is a researcher for a non-profit organization dedicated to accelerating the adoption of solar energy. The organization relies on data from various sources, including sensors, satellite imagery, and field measurements, to inform solar energy allocation, usage, and conservation decisions.\nRecently, he identified an available dataset containing data on the solar energy market size, including trends, competition, and customer demand. These data can inform business and policy decisions related to solar energy adoption. Adam is particularly excited because this is a multivariate time series dataset from the past ten years. Also, the data documentation listed many important variables for his project, including the compound annual growth rates (CAGR) for solar energy companies. However, when Adam inspected some of the data files, he noticed a few data points that needed to be corrected. For example, some rows had NAs; others had 000, 999, and -999 or were utterly blank; the documentation does not help him infer those values.\nWhen he contacted the corresponding researcher for clarification, he was told these inconsistencies could have been caused either due to system migration or by human error in inaccurate data entry. The researcher mentioned that his team had multiple contributors throughout the years and noted there were no enforced validation rules or data quality checks. Ultimately, Adam should choose a solution that balances the benefits of using the existing dataset with the potential risks of using incomplete or inaccurate data.\nAdam faces a dilemma. On the one hand, the dataset could provide valuable insights into the solar energy market and inform better policies and management decisions. On the other hand, the missing and anomalous data could affect the dataset’s overall quality and integrity, potentially leading to incorrect conclusions and decisions." + "text": "DuckDB has been installed on the MEDS server. We also recommend to install it on your personal machine following those instructions: https://duckdb.org/docs/installation/" }, { - "objectID": "modules/week08/case-c.html#instructions", - "href": "modules/week08/case-c.html#instructions", - "title": "Case Study C: To reuse or not reuse, that is the key question!", + "objectID": "installing-duckdb.html#installation", + "href": "installing-duckdb.html#installation", + "title": "Installing duckDB", "section": "", - "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nAdam is a researcher for a non-profit organization dedicated to accelerating the adoption of solar energy. The organization relies on data from various sources, including sensors, satellite imagery, and field measurements, to inform solar energy allocation, usage, and conservation decisions.\nRecently, he identified an available dataset containing data on the solar energy market size, including trends, competition, and customer demand. These data can inform business and policy decisions related to solar energy adoption. Adam is particularly excited because this is a multivariate time series dataset from the past ten years. Also, the data documentation listed many important variables for his project, including the compound annual growth rates (CAGR) for solar energy companies. However, when Adam inspected some of the data files, he noticed a few data points that needed to be corrected. For example, some rows had NAs; others had 000, 999, and -999 or were utterly blank; the documentation does not help him infer those values.\nWhen he contacted the corresponding researcher for clarification, he was told these inconsistencies could have been caused either due to system migration or by human error in inaccurate data entry. The researcher mentioned that his team had multiple contributors throughout the years and noted there were no enforced validation rules or data quality checks. Ultimately, Adam should choose a solution that balances the benefits of using the existing dataset with the potential risks of using incomplete or inaccurate data.\nAdam faces a dilemma. On the one hand, the dataset could provide valuable insights into the solar energy market and inform better policies and management decisions. On the other hand, the missing and anomalous data could affect the dataset’s overall quality and integrity, potentially leading to incorrect conclusions and decisions." + "text": "DuckDB has been installed on the MEDS server. We also recommend to install it on your personal machine following those instructions: https://duckdb.org/docs/installation/" }, { - "objectID": "modules/week08/case-c.html#questions", - "href": "modules/week08/case-c.html#questions", - "title": "Case Study C: To reuse or not reuse, that is the key question!", - "section": "Questions", - "text": "Questions\n\nQuestion 1\nSuppose Adam is leaning toward reusing the dataset despite the identified problems. What general ethical and responsible steps would you advise him to take moving forward? (Select all that apply)\n\nAdam should carefully examine the dataset to map all existing issues to determine the extent of missing and anomalous data and how it could affect the accuracy of his analysis.\nIf Adam collects new data, he should enforce data validation rules to tables and perform quality checks to avoid similar problems.\nIf Adam integrates new data and perform all necessarily adjustments to remove uncertainties and other problems in the data, he no longer needs to attribute original data creators.\nIf the dataset has significant missing or anomalous data, Adam may need to collect additional data to ensure the analysis is accurate and representative. This may involve additional time and resources but could lead to more accurate and reliable insights.\nAdam may be able to proceed with the analysis after carefully documenting and accounting for any uncertainties or limitations in the data.\nAdam should produce new documentation for the dataset based on all improvements he makes to the original data." + "objectID": "installing-duckdb.html#visual-code-integration-optional", + "href": "installing-duckdb.html#visual-code-integration-optional", + "title": "Installing duckDB", + "section": "Visual code integration (optional)", + "text": "Visual code integration (optional)\nYou can use DuckDB directly from the terminal but if you also want to have the option to have both a SQL script and the terminal open, we recommend to use visual code. There is one setting to be done to link the script and the terminal:\nIn Visual Code:\n\nopen the palette Shift+Cmd+P and search for: Preferences: Open Keyboard Shortcuts (JSON) \nenter the following text & save\n\n// Place your key bindings in this file to override the defaults\n[\n {\n \"key\": \"shift+enter\",\n \"command\": \"workbench.action.terminal.runSelectedText\"\n }\n]\nOR alternative method to set this up:\nUnder the main VS menu, go to Settings -> Keyboard Shortcuts, search for “run selected”, the “run selected text in active terminal” will be one of the options. Set the shift-enter keypress by double-clicking. Right-click on the When column, select Change When Expression, and add editorLangId == 'sql' to restrict to just SQL files and not every kind of file.\nNow you can hit Shift+Return at then end of a line in your SQL script and it should run the command directly in the terminal!" }, { - "objectID": "modules/week08/case-b.html", - "href": "modules/week08/case-b.html", - "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", + "objectID": "installing-duckdb.html#test", + "href": "installing-duckdb.html#test", + "title": "Installing duckDB", + "section": "Test", + "text": "Test\nLet’s test our new installation using visual code\n\nopen a terminal from the terminal menu -> new terminal\nIn the terminal type duckdb, this should start duckdb\nOpen a New text file: file menu -> new text file\nCopy paste the following code:\n\n-- Start the DB at the terminal: duckdb\n\nCREATE TABLE ducks AS SELECT 3 As age, 'mandarin' AS breed;\n\nSHOW TABLES;\n\nFROM ducks SELECT *;\n\nUse Shift+Return to run the SQL code line by line\n\nYou should have something that looks like this:" + }, + { + "objectID": "about.html", + "href": "about.html", + "title": "About RDS", "section": "", - "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nA team of environmental scientists from the University of Wisconsin is researching the impact of climate change on the Mississippi River wildlife. The research involves collecting data on various environmental factors, including water temperature, salinity, pH, and nutrient levels, as well as data on the abundance and diversity of aquatic species on the lower basins of the river.\nThe team collects the data using various methods, including river sensors, underwater cameras, and traditional sampling techniques. The data is stored on a cloud-based platform, allowing sharing and real-time collaboration with multiple partners. However, the researchers soon realized that some of the collected data might be sensitive, including information on the distribution and abundance of the pallid sturgeon, recently listed by the Wisconsin Department of Natural Resources as one of the endangered species. They also realized that some data points might be of commercial interest, which can threaten the protection of the species and its habitat. For example, exposing the exact locations of the pallid sturgeon and their critical habitats could lead to increased poaching or unauthorized fishing, as the species is highly valued for its caviar.\nResearchers face ethical considerations related to sensitive data sharing. They want to comply with best practices in open science. However, in this process, they must balance the need to share their data openly to advance research and understanding of climate change and the Mississippi River ecosystem while protecting sensitive data and preventing it from being used for commercial purposes that could harm the environment." + "text": "Research Data Services (RDS) helps UCSB researchers manage and preserve their research data through:\n\nConsultations\nLong-term engagements\nInstructional workshops\n\nOur team offers support across the research data lifecycle, from pre-project planning to post-project archival, connecting researchers with both locally- and externally-provided curation services. Our goal is to ensure that all research data is well-described, FAIR (Findable, Accessible, Interoperable, Reusable), and sustainably preservable, and that researchers receive scholarly credit for sharing and publishing data.\nContact us if you have any questions: rds@library.ucsb.edu" }, { - "objectID": "modules/week08/case-b.html#instructions", - "href": "modules/week08/case-b.html#instructions", - "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", + "objectID": "resources.html", + "href": "resources.html", + "title": "Resources", "section": "", - "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nA team of environmental scientists from the University of Wisconsin is researching the impact of climate change on the Mississippi River wildlife. The research involves collecting data on various environmental factors, including water temperature, salinity, pH, and nutrient levels, as well as data on the abundance and diversity of aquatic species on the lower basins of the river.\nThe team collects the data using various methods, including river sensors, underwater cameras, and traditional sampling techniques. The data is stored on a cloud-based platform, allowing sharing and real-time collaboration with multiple partners. However, the researchers soon realized that some of the collected data might be sensitive, including information on the distribution and abundance of the pallid sturgeon, recently listed by the Wisconsin Department of Natural Resources as one of the endangered species. They also realized that some data points might be of commercial interest, which can threaten the protection of the species and its habitat. For example, exposing the exact locations of the pallid sturgeon and their critical habitats could lead to increased poaching or unauthorized fishing, as the species is highly valued for its caviar.\nResearchers face ethical considerations related to sensitive data sharing. They want to comply with best practices in open science. However, in this process, they must balance the need to share their data openly to advance research and understanding of climate change and the Mississippi River ecosystem while protecting sensitive data and preventing it from being used for commercial purposes that could harm the environment." + "text": "Burnette. (2022). Managing environmental data: principles, techniques, and best practices. CRC Press. https://search.library.ucsb.edu/permalink/01UCSB_INST/1aqck9j/alma9917095897506531\nChapman, AD, & Grafton, O. (2008). Guide to Best Practices for Generalising Sensitive Species-Occurrence Data (v1.0). https://doi.org/10.15468/doc-b02j-gt10\nCrystal-Ornelas, R., Varadharajan, C., O’Ryan, D., Ramírez-Muñoz, J., Jones, M. B., Lehnert, K. A., … & Servilla, M. (2022). Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats. Scientific Data, 9(1), 700. https://doi.org/10.1038/s41597-022-01606-w\nJones, M. B., O’Brien, M., Mecum, B., Boettiger, C., Schildhauer, M., Maier, M., Whiteaker, T., Earl, S., & Chong, S. (2019). Ecological Metadata Language version 2.2.0. KNB Data Repository. https://doi.org/10.5063/F11834T2\nLabastida, I., & Margoni, T. (2020). Licensing FAIR Data for Reuse. Data Intelligence, 2(1-2), 199-207. https://doi.org/10.1162/dint_a_00042\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\nRecknagel, F., & Michener, W. K. (Eds.). (2018). Ecological informatics: Data management and knowledge discovery (3rd ed.). Springer." }, { - "objectID": "modules/week08/case-b.html#questions", - "href": "modules/week08/case-b.html#questions", - "title": "Case Study B: The caveat of the caviar: navigating ethics to protect endangered river wildlife", - "section": "Questions", - "text": "Questions\n\nQuestion 1\nAs a data manager, what recommendations would you offer to the researchers to avoid commercial exploitation of the pallid sturgeon while contributing to advancing the research in the field? You do not need to write a long essay; elaborating your advice in bullet points is enough.\n\n\nQuestion 2\nAs a general rule, the _____________ of endangered, sensitive species should not be shared publicly. (hint: 1-2 words)" + "objectID": "resources.html#bibliography", + "href": "resources.html#bibliography", + "title": "Resources", + "section": "", + "text": "Burnette. (2022). Managing environmental data: principles, techniques, and best practices. CRC Press. https://search.library.ucsb.edu/permalink/01UCSB_INST/1aqck9j/alma9917095897506531\nChapman, AD, & Grafton, O. (2008). Guide to Best Practices for Generalising Sensitive Species-Occurrence Data (v1.0). https://doi.org/10.15468/doc-b02j-gt10\nCrystal-Ornelas, R., Varadharajan, C., O’Ryan, D., Ramírez-Muñoz, J., Jones, M. B., Lehnert, K. A., … & Servilla, M. (2022). Enabling FAIR data in Earth and environmental science with community-centric (meta)data reporting formats. Scientific Data, 9(1), 700. https://doi.org/10.1038/s41597-022-01606-w\nJones, M. B., O’Brien, M., Mecum, B., Boettiger, C., Schildhauer, M., Maier, M., Whiteaker, T., Earl, S., & Chong, S. (2019). Ecological Metadata Language version 2.2.0. KNB Data Repository. https://doi.org/10.5063/F11834T2\nLabastida, I., & Margoni, T. (2020). Licensing FAIR Data for Reuse. Data Intelligence, 2(1-2), 199-207. https://doi.org/10.1162/dint_a_00042\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\nRecknagel, F., & Michener, W. K. (Eds.). (2018). Ecological informatics: Data management and knowledge discovery (3rd ed.). Springer." + }, + { + "objectID": "syllabus.html", + "href": "syllabus.html", + "title": "Course syllabus", + "section": "", + "text": "This course will teach students the fundamentals of relational databases and data management. Students will learn the principles of database modeling and design and gain practical experience applying SQL (Structured Query Language) to manage and manipulate relational databases. The course also introduces the role and application of data documentation and metadata standards for interoperability and effective data management. By the end of the course, students will be equipped to make informed decisions about managing databases and data ethically and responsibly, focusing on issues such as bias, data privacy, sharing, ownership, and licensing." + }, + { + "objectID": "syllabus.html#overview", + "href": "syllabus.html#overview", + "title": "Course syllabus", + "section": "", + "text": "This course will teach students the fundamentals of relational databases and data management. Students will learn the principles of database modeling and design and gain practical experience applying SQL (Structured Query Language) to manage and manipulate relational databases. The course also introduces the role and application of data documentation and metadata standards for interoperability and effective data management. By the end of the course, students will be equipped to make informed decisions about managing databases and data ethically and responsibly, focusing on issues such as bias, data privacy, sharing, ownership, and licensing." + }, + { + "objectID": "syllabus.html#learning-objectives", + "href": "syllabus.html#learning-objectives", + "title": "Course syllabus", + "section": "Learning objectives", + "text": "Learning objectives\n\nUnderstand the fundamental principles of relational databases and relational data modeling, including table structures, primary and foreign keys, relationships between tables, and data normalization.\nUnderstand how to use the Unix command line and how manage DuckDB databases from the command line.\nUse SQL to retrieve, manipulate, and manage data stored in a relational database.\nDemonstrate proficiency in querying, filtering, sorting, and programmatically accessing and interacting with relational databases from R and Python.\nBecome familiar with advanced database topics such as concurrency, transactions, indexing, backups, and publication.\nUnderstand the role of good data documentation and metadata standards for interoperability, effective data management, and reproducibility.\nOperationalize the FAIR principles into data management practices.\nProduce a metadata record in EML (Ecological Metadata Language) and apply metadata crosswalks to programmatically convert metadata schemas.\nUnderstand the ethics of sensitive data and how to de-identify sensitive data.\nEvaluate ethical and responsible data management practices, including bias, data privacy, sharing, ownership, and licensing issues." + }, + { + "objectID": "syllabus.html#schedule", + "href": "syllabus.html#schedule", + "title": "Course syllabus", + "section": "Schedule", + "text": "Schedule\n\nClass: Monday & Wednesday 9:30-10:45 am (NCEAS)\nDiscussion - session 1: Thur 1-1:50PM, Bren Hall 1510\nDiscussion - session 2: Thur 2-2:50PM, Bren Hall 1510\nOffice hours: Monday 11-12 pm (NCEAS)" + }, + { + "objectID": "syllabus.html#modules", + "href": "syllabus.html#modules", + "title": "Course syllabus", + "section": "Modules", + "text": "Modules\n\n\n\nWeek\nTopic/Content\n\n\n\n\n1\nRelational databases and data modeling\n\n\n2\nAnalyzing & cleaning the bird dataset (from csv)\n\n\n3\nIntroduction to SQL part 1 & DuckDB\n\n\n4\nImporting data in the database + SQL part 2\n\n\n5\nAnalyzing the bird database using SQL + bash programming\n\n\n6\nUsing R & python to query databases\n\n\n7\nDocumenting your work: metdata & computing environment\n\n\n8\nSensitive data\n\n\n9\nEthical & responsible data mgnt\n\n\n10\nData licensing and publication\n\n\n\n*Schedule subject to change." + }, + { + "objectID": "syllabus.html#course-assessment", + "href": "syllabus.html#course-assessment", + "title": "Course syllabus", + "section": "Course assessment", + "text": "Course assessment\nYour performance in this course will depend 90% on weekly homework assignments and 10% on class participation. There will be no graded exercises or homework the last week. We will use Canvas to manage the assignments." + }, + { + "objectID": "syllabus.html#attendance-and-homework-policy", + "href": "syllabus.html#attendance-and-homework-policy", + "title": "Course syllabus", + "section": "Attendance and homework policy", + "text": "Attendance and homework policy\nAttendance is required. Material will be given in class that is not covered by slides or background readings.\nHomework is expected to be turned in on time. A generous amount of time will be given to complete assignments. Do not wait until the last minute to work on homework in case something unexpected comes up. Homework turned in late will be docked 20% per day." + }, + { + "objectID": "syllabus.html#code-of-conduct", + "href": "syllabus.html#code-of-conduct", + "title": "Course syllabus", + "section": "Code of conduct", + "text": "Code of conduct\nAll students are expected to read and comply with the UCSB Student Conduct Code. We expect cooperation from all members to help ensure a welcoming and inclusive environment for everybody. We are determined to make our courses welcoming, inclusive and harassment-free for everyone regardless of gender, gender identity and expression, race, age, sexual orientation, disability, physical appearance, or religion (or lack thereof). We do not tolerate harassment of class participants, teaching assistants, or instructors in any form. Derogatory, abusive, or demeaning language or imagery will not be tolerated." + }, + { + "objectID": "syllabus.html#student-support", + "href": "syllabus.html#student-support", + "title": "Course syllabus", + "section": "Student support", + "text": "Student support\nWe understand that ongoing crises impact students differently based on experiences, identities, living situations and resources, family responsibilities, and unforeseen challenges. We encourage you to prioritize your well-being. We are here to help you reach your learning and career goals. You are always welcome to reach out to our teaching team so that we can best support you. Please see the UCSB Campus Resource Guide for campus student support and services." + }, + { + "objectID": "syllabus.html#disabled-students-program", + "href": "syllabus.html#disabled-students-program", + "title": "Course syllabus", + "section": "Disabled students program", + "text": "Disabled students program\nStudents with disabilities and/or alternative learning needs are encouraged to work with the Disabled Students Program at UCSB to ensure we can best support your learning and success." + }, + { + "objectID": "modules/week08/index-08.html", + "href": "modules/week08/index-08.html", + "title": "Week 8 - Ethical and responsible data management", + "section": "", + "text": "Understand fundamental ethical and responsible data management principles, focusing on the importance of data documentation, preventing bias and harm, properly handling sensitive data, ownership, and licensing issues\nRelate ethical and responsible data management principles to real-world scenarios" + }, + { + "objectID": "modules/week08/index-08.html#learning-objectives", + "href": "modules/week08/index-08.html#learning-objectives", + "title": "Week 8 - Ethical and responsible data management", + "section": "", + "text": "Understand fundamental ethical and responsible data management principles, focusing on the importance of data documentation, preventing bias and harm, properly handling sensitive data, ownership, and licensing issues\nRelate ethical and responsible data management principles to real-world scenarios" + }, + { + "objectID": "modules/week08/index-08.html#slides", + "href": "modules/week08/index-08.html#slides", + "title": "Week 8 - Ethical and responsible data management", + "section": "Slides", + "text": "Slides\nslides-08.pptx\nSuggested readings\n\nBoté, J. J., & Térmens, M. (2019). Reusing data: Technical and ethical challenges. DESIDOC Journal of Library & Information Technology, 39(6) http://hdl.handle.net/2445/151341\nMcGovern, A., Ebert-Uphoff, I., Gagne, D., & Bostrom, A. (2022). Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science. Environmental Data Science, 1, E6. https://doi.org/10.1017/eds.2022.5\n\nAdditional suggested readings are noted in the slides." + }, + { + "objectID": "modules/week08/index-08.html#case-based-discussion", + "href": "modules/week08/index-08.html#case-based-discussion", + "title": "Week 8 - Ethical and responsible data management", + "section": "Case-based discussion:", + "text": "Case-based discussion:\n\nCase Study A: Containing the flames of bias in machine learning\nCase Study B: The caveat is the caviar: navigating ethics to protect endangered river wildlife\nCase Study C: To reuse or not reuse, that is the key question!\nCase Study D: Navigating the complexities of ownership zones" }, { - "objectID": "modules/week04/hw-04-2.html", - "href": "modules/week04/hw-04-2.html", - "title": "Week 4 - Who worked with whom?", + "objectID": "modules/week08/case-d.html", + "href": "modules/week08/case-d.html", + "title": "eds213", "section": "", - "text": "The Camp_assignment table lists where each person worked and when. Your goal is to answer, Who worked with whom? That is, you are to find all pairs of people who worked at the same site, and whose date ranges overlap while at that site. This can be solved using a self-join.\nA self-join of a table is a regular join, but instead of joining two different tables, we join two copies of the same table, which we will call the “A” copy and the “B” copy:\nThe idea is that the above join will give us rows that pair every person/site/date range with every other person/site/date range. With no conditions on the join, since there are \\(441\\) rows in the Camp_assignment table, the join will produce \\(441^2 = 194481\\) rows. But out of all those rows we want only those where the two people worked at the same site. So:\nSubmit your final SQL query." + "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nTina, a researcher working on coastal vulnerability analysis in Southern California, acquired LiDAR data from a vendor in 2017. Based on the acquired dataset, she submitted a paper to a high-impact academic journal early this year. The paper was accepted but is pending publication until Tina complies with the mandate of sharing supporting data and associated documentation in an open repository. While inspecting the data documentation, Max, the repository data manager, noticed that the files included raw and processed data from a vendor; however, no explicit declaration of authorization to share the data was included in the submission package. Tina presented an invoice of $20,000 USD certifying that she obtained the data and said she was told verbally that the data was not subject to any use restrictions." }, { - "objectID": "modules/week04/hw-04-2.html#bonus-problem", - "href": "modules/week04/hw-04-2.html#bonus-problem", - "title": "Week 4 - Who worked with whom?", - "section": "Bonus problem!", - "text": "Bonus problem!\nProduce this much nicer table by joining with the Personnel table:\n┌─────────┬─────────────────────┬───────────────────┐\n│ Site │ Name_1 │ Name_2 │\n│ varchar │ varchar │ varchar │\n├─────────┼─────────────────────┼───────────────────┤\n│ lkri │ Anastasia Popovkina │ Gleb Sedash │\n│ lkri │ Anastasia Popovkina │ Julya Loshchagina │\n└─────────┴─────────────────────┴───────────────────┘\nYou’ll need to join with the Personnel table twice, once for each observer column. You may need give abbreviations to tables (e.g., JOIN Personnel AS p1) to distinguish the tables and columns. You can do it!" + "objectID": "modules/week08/case-d.html#instructions", + "href": "modules/week08/case-d.html#instructions", + "title": "eds213", + "section": "", + "text": "Read the scenario and answer the questions based on the weekly readings and the lecture:\nTina, a researcher working on coastal vulnerability analysis in Southern California, acquired LiDAR data from a vendor in 2017. Based on the acquired dataset, she submitted a paper to a high-impact academic journal early this year. The paper was accepted but is pending publication until Tina complies with the mandate of sharing supporting data and associated documentation in an open repository. While inspecting the data documentation, Max, the repository data manager, noticed that the files included raw and processed data from a vendor; however, no explicit declaration of authorization to share the data was included in the submission package. Tina presented an invoice of $20,000 USD certifying that she obtained the data and said she was told verbally that the data was not subject to any use restrictions." }, { - "objectID": "modules/week04/hw-04-1.html", - "href": "modules/week04/hw-04-1.html", - "title": "Week 4 - Missing data", - "section": "", - "text": "Which sites have no egg data? Please answer this question using all three techniques demonstrated in class. In doing so, you will need to work with the Bird_eggs table, the Site table, or both. As a reminder, the techniques are:\n\nUsing a Code NOT IN (subquery) clause.\nUsing an outer join with a WHERE clause that selects the desired rows. Caution: make sure your IS NULL test is performed against a column that is not ordinarily allowed to be NULL. You may want to consult the database schema to remind yourself of column declarations.\nUsing the set operation EXCEPT.\n\nAdd an ORDER BY clause to your queries so that all three produce the exact same result:\n┌─────────┐\n│ Code │\n│ varchar │\n├─────────┤\n│ barr │\n│ burn │\n│ bylo │\n│ cakr │\n│ cari │\n│ chau │\n│ coat │\n│ colv │\n│ iglo │\n│ ikpi │\n│ lkri │\n│ made │\n│ nome │\n│ prba │\n├─────────┤\n│ 14 rows │\n└─────────┘\nSubmit your SQL." + "objectID": "modules/week08/case-d.html#questions", + "href": "modules/week08/case-d.html#questions", + "title": "eds213", + "section": "Questions", + "text": "Questions\n\nQuestion 1\nMax should advise Tina to acquire explicit permission from the data vendor to share the data.\n\nTrue\nFalse\n\n\n\nQuestion 2\nBecause Tina paid for the data, Max can move forward with the data publication without infringing on any legal and ethical aspects.\n\nTrue\nFalse\n\n\n\nQuestion 3\nIf Tina does not acquire explicit permission from the vendor to share the data, Max can’t publish the data in the repository.\n\nTrue\nFalse\n\n\n\nQuestion 4\nIf Tina does not acquire written permission to share the data, Max can suggest Tina share only aggregated data.\n\nTrue\nFalse" }, { - "objectID": "modules/week06/hw-06-3.html", - "href": "modules/week06/hw-06-3.html", - "title": "Week 6 - Who were the winners?", + "objectID": "modules/week08/case-a.html", + "href": "modules/week08/case-a.html", + "title": "Case Study A: Containing the flames of bias in machine learning", "section": "", - "text": "At the conclusion of the ASDN project the PIs decided to hand out first, second, and third prizes to the observers who measured the most eggs. Who won? Please use R and dbplyr to answer this question, and please submit your R code. Your code should print out:\n# Ordered by: desc(total_eggs)\n Name total_eggs\n <chr> <int>\n1 Vanessa Loverti 163\n2 Dylan Kessler 87\n3 Richard Lanctot 50\nYou’ll want to load database tables using statements such as:\negg_table <- tbl(conn, \"Bird_eggs\")\nand then use tidyverse grouping, summarization, joining, and other functions to compute the desired result.\nAlso, take your final expression and pipe it into show_query(). If you used multiple R statements, did dbplyr create a temporary table, or did it manage to do everything in one query? Did it limit to the first three rows using an R expression or an SQL LIMIT clause?" + "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nWildfires have become increasingly common and destructive in many regions worldwide, causing significant environmental and social problems. In response, many communities have implemented fire prevention and management strategies, including using machine learning (ML) algorithms to predict and mitigate the risk of wildfires.\nOakdale, located in a densely forested area in British Columbia, Canada, has implemented an ML algorithm to predict the risk of wildfires and prioritize fire prevention resources. The algorithm uses a variety of inputs, including historical fire data, weather patterns, topography, and vegetation coverage, to generate a risk score for each area of the city. However, after several months of using the algorithm, city officials noticed that specific neighborhoods with low-income and minority populations consistently receive lower risk scores than other areas with very similar environmental conditions. Upon closer examination of those patterns in the data, they realized that the historical data used to train the algorithm was heavily concentrated on more affluent and predominantly white neighborhoods, resulting in a skewed view of the fire risks for the whole city." }, { - "objectID": "modules/week06/hw-06-2.html", - "href": "modules/week06/hw-06-2.html", - "title": "Week 6 - Characterizing egg variation", + "objectID": "modules/week08/case-a.html#instructions", + "href": "modules/week08/case-a.html#instructions", + "title": "Case Study A: Containing the flames of bias in machine learning", "section": "", - "text": "You read Egg Dimensions and Neonatal Mass of Shorebirds by Robert E. Ricklefs and want to see how the egg data we’ve been using in class compares to his results. Specifically, Ricklefs reported, “Coefficients of variation were 4 to 9% for egg volume” for shorebird eggs gathered in Manitoba, Canada. What is the range of coefficients of variation in our ASDN dataset?\nThe “coefficient of variation,” or CV, is a unitless measure of the variation of a sample, defined as the standard deviation divided by the mean and multiplied by 100 to express as a percentage. Thus, a CV of 10% means the standard deviation is 10% of the mean value. For the purposes of this computation, we will copy Ricklefs and use as a proxy for egg volume the formula\n\\[ W^2 L \\]\nwhere \\(W\\) is egg width and \\(L\\) is egg length.\nYour task is to create a Python program that reads data from the ASDN database and uses Pandas to compute, for each species in the database (for which there is egg data), the coefficient of variation of volume using the above formula. There are many ways this can be done. Because this assignment is primarily about programming in Python, please follow the steps below. Please submit your Python code when done.\n\nStep 1\nCreate a query that will return the distinct species for which there is egg data (not all species and not all nests have egg data), so that you can then loop over those species. Your query should return two columns, species code and scientific name. Please order the results in alphabetic order of scientific name.\n\n\nStep 2\nAfter you’ve connected to the database and created a cursor c, iterate over the species like so:\nspecies_query = \"\"\"SELECT Code, Scientific_name FROM...\"\"\"\nfor row in c.execute(species_query).fetchall(): # DuckDB lame-o workaround\n species_code = row[0]\n scientific_name = row[1]\n # query egg data for that species (step 3)\n # compute statistics and print results (step 4)\n\n\nStep 3\nYou will need to construct a query that gathers egg data for a given species, one species at a time; the species code will be a parameter to that query. You can compute the formula\n\\[ W^2 L \\]\nin SQL or in Pandas. For simplicity, SQL is suggested:\negg_query = \"\"\"SELECT Width*Width*Length AS Volume FROM...\"\"\"\nWithin the loop, you will want to execute the query on the current species in the loop iteration. You may use the Pandas function pd.read_sql to do so and also directly load the results into a dataframe:\ndf = pd.read_sql(egg_query, conn, ...)\nDo a help(pd.read_sql) to figure out how to pass parameters to a query.\nYou may get a bunch of warnings from Pandas about how it “only supports SQLAlchemy…”. Just ignore them. (Sorry about that.)\n\n\nStep 4\nFinally, and still within your loop, you’ll want to compute statistics and print out the results:\ncv = round(df.Volume.std()/df.Volume.mean()*100, 2)\nprint(f\"{scientific_name} {cv}%\")\nYour output should look like this:\nArenaria interpres 21.12%\nCalidris alpina 5.46%\nCalidris fuscicollis 16.77%\nCharadrius semipalmatus 8.99%\nPhalaropus fulicarius 4.65%\nPluvialis dominica 19.88%\nPluvialis squatarola 6.94%\n\n\nAppendix\nIt’s not necessary to use pd.read_sql to get data into a dataframe, it’s just a convenience. To do so manually (and to show you it’s not that hard), imagine that your query returns three columns. Collect the row data into three separate lists, then manually create a dataframe specifying the contents as a dictionary:\nrows = c.execute(\"SELECT Species, Width, Length FROM...\").fetchall()\nspecies_column = [t[0] for t in rows]\nwidth_column = [t[1] for t in rows]\nlength_column = [t[2] for t in rows]\n\ndf = pd.DataFrame(\n {\n \"species\": species_column,\n \"width\": width_column,\n \"length\": length_column\n }\n)" + "text": "Read the scenario and answer questions based on the weekly readings and the lecture:\nWildfires have become increasingly common and destructive in many regions worldwide, causing significant environmental and social problems. In response, many communities have implemented fire prevention and management strategies, including using machine learning (ML) algorithms to predict and mitigate the risk of wildfires.\nOakdale, located in a densely forested area in British Columbia, Canada, has implemented an ML algorithm to predict the risk of wildfires and prioritize fire prevention resources. The algorithm uses a variety of inputs, including historical fire data, weather patterns, topography, and vegetation coverage, to generate a risk score for each area of the city. However, after several months of using the algorithm, city officials noticed that specific neighborhoods with low-income and minority populations consistently receive lower risk scores than other areas with very similar environmental conditions. Upon closer examination of those patterns in the data, they realized that the historical data used to train the algorithm was heavily concentrated on more affluent and predominantly white neighborhoods, resulting in a skewed view of the fire risks for the whole city." }, { - "objectID": "modules/week06/index-06.html", - "href": "modules/week06/index-06.html", - "title": "Week 6 - Programming with databases", - "section": "", - "text": "Understand the basic database programming model\nAccess a DuckDB database from Python and R\nUnderstand how to use the Python/Pandas and R/dbplyr convenience functions" + "objectID": "modules/week08/case-a.html#questions", + "href": "modules/week08/case-a.html#questions", + "title": "Case Study A: Containing the flames of bias in machine learning", + "section": "Questions", + "text": "Questions\n\nQuestion 1\nThis case presents an ethical concern primarily associated with what?\n\n\nQuestion 2\nAccording to McGovern et al. (2022), which AI/ML issues can be identified in this case study? Justify your answer.\n\n\nQuestion 3\nSuppose you were hired as a consultant by Oakdale’s city officials. Which of the following recommendations would you give them to prevent perpetuating bias and inequitable outcomes? (Select all that apply)\n\nImplement transparency measures that make the algorithms’ decision-making processes more visible and understandable to stakeholders. This may include clarifying how decisions are made, sharing data sources, and providing access to model outputs. Fully document any limitations and shortcomings of the model and data.\nInvolve diverse stakeholders in the algorithm development and testing, including individuals from communities whose outputs may disproportionately impact. This can help identify and address potential biases and ensure that the algorithm is designed with the interests of all community members in mind.\nContinue using the algorithm as the official decision-making source until the re-training is completed. After all, ML methods are more efficient than traditional fire prevention strategies (e.g., fire breaks and vegetation management)." }, { - "objectID": "modules/week06/index-06.html#learning-objectives", - "href": "modules/week06/index-06.html#learning-objectives", - "title": "Week 6 - Programming with databases", + "objectID": "modules/week04/index-04.html", + "href": "modules/week04/index-04.html", + "title": "Week 4 - SQL and DuckDB", "section": "", - "text": "Understand the basic database programming model\nAccess a DuckDB database from Python and R\nUnderstand how to use the Python/Pandas and R/dbplyr convenience functions" + "text": "Continued exploration of SQL concepts including joins, views, and set operations; and apply it to conduct data analysis." }, { - "objectID": "modules/week06/index-06.html#slides-and-other-materials", - "href": "modules/week06/index-06.html#slides-and-other-materials", - "title": "Week 6 - Programming with databases", - "section": "Slides and other materials", - "text": "Slides and other materials\nslides-06.pptx\nCoding transcripts:\n\npython-programming.ipynb\npython-programming-cont.ipynb\nusing_dbplyr-06-wed-r.qmd" + "objectID": "modules/week04/index-04.html#learning-objectives", + "href": "modules/week04/index-04.html#learning-objectives", + "title": "Week 4 - SQL and DuckDB", + "section": "", + "text": "Continued exploration of SQL concepts including joins, views, and set operations; and apply it to conduct data analysis." }, { - "objectID": "modules/week06/index-06.html#resources", - "href": "modules/week06/index-06.html#resources", - "title": "Week 6 - Programming with databases", - "section": "Resources", - "text": "Resources\n\nhttps://peps.python.org/pep-0249/\n\nCommon Python-RDBMS API\n\nhttps://duckdb.org/docs/api/python/overview\n\nPython DuckDB module\n\nhttps://dbplyr.tidyverse.org/\n\nR dbplyr documentation" + "objectID": "modules/week04/index-04.html#slides-and-other-materials", + "href": "modules/week04/index-04.html#slides-and-other-materials", + "title": "Week 4 - SQL and DuckDB", + "section": "Slides and other materials", + "text": "Slides and other materials\nslides-04.pptx\nLecture notes:\n\nlecture-notes-04-mon.txt\nclass-script-04-mon.sql\nclass-script-04-wed-empty.sql\nclass-script-04-wed-solution.sql\n\nData:\n\nClass data GitHub repository, week 3 – for Monday\nClass data GitHub repository, week 4 – for Wednesday only\nASDN dataset ER (entity-relationship) diagram" }, { - "objectID": "modules/week06/index-06.html#homework", - "href": "modules/week06/index-06.html#homework", - "title": "Week 6 - Programming with databases", + "objectID": "modules/week04/index-04.html#homework", + "href": "modules/week04/index-04.html#homework", + "title": "Week 4 - SQL and DuckDB", "section": "Homework", - "text": "Homework\nLittle Bobby Tables\nCharacterizing egg variation\nWho were the winners?" + "text": "Homework\nMissing data\nWho worked with whom?\nWho’s the culprit?" }, { - "objectID": "modules/week03/index-03.html", - "href": "modules/week03/index-03.html", - "title": "Week 3 - Structured Query Language (SQL) & DuckDB", + "objectID": "modules/week04/hw-04-3.html", + "href": "modules/week04/hw-04-3.html", + "title": "Week 4 - Who’s the culprit?", "section": "", - "text": "Understand the relationship of SQL to relational databases\nUnderstand how local databases differ from client/server databases\nUnderstand basic SQL syntax and statements\nBe able to answer basic questions about data" + "text": "You’re reading up on how eggs are aged by floating them in water 1:\nwhen you receive an urgent phone call from a colleague who says they just discovered that an observer, who worked at the “nome” site between 1998 and 2008 inclusive, had been floating eggs in salt water and not freshwater. The density of salt water being different, those measurements are incorrect and need to be adjusted. The colleague says that this incorrect technique was used on exactly 36 nests, but before you can ask who the observer was, the phone is disconnected. Who made this error? That is, looking at nest data for “nome” between 1998 and 2008 inclusive, and for which egg age was determined by floating, can you determine the name of the observer who observed exactly 36 nests? Please submit your SQL. Your SQL should return exactly one row, the answer. That is, your query should produce:" }, { - "objectID": "modules/week03/index-03.html#learning-objectives", - "href": "modules/week03/index-03.html#learning-objectives", - "title": "Week 3 - Structured Query Language (SQL) & DuckDB", - "section": "", - "text": "Understand the relationship of SQL to relational databases\nUnderstand how local databases differ from client/server databases\nUnderstand basic SQL syntax and statements\nBe able to answer basic questions about data" + "objectID": "modules/week04/hw-04-3.html#footnotes", + "href": "modules/week04/hw-04-3.html#footnotes", + "title": "Week 4 - Who’s the culprit?", + "section": "Footnotes", + "text": "Footnotes\n\n\nLiebezeit, Joseph R., et al. “Assessing the Development of Shorebird Eggs Using the Flotation Method: Species-Specific and Generalized Regression Models.” The Condor, vol. 109, no. 1, 2007, pp. 32–47. JSTOR, http://www.jstor.org/stable/4122529↩︎" }, { - "objectID": "modules/week03/index-03.html#slides-and-other-materials", - "href": "modules/week03/index-03.html#slides-and-other-materials", - "title": "Week 3 - Structured Query Language (SQL) & DuckDB", - "section": "Slides and other materials", - "text": "Slides and other materials\nslides-03.pptx\nLecture notes:\n\nlecture-notes-03-mon.txt\nlecture-notes-03-wed.txt\nclass-script-03-wed.sql\n\nASDN dataset ER (entity-relationship) diagram\nClass data GitHub repository, week 3" + "objectID": "modules/week06/python-programming-cont.html", + "href": "modules/week06/python-programming-cont.html", + "title": "Pandas", + "section": "", + "text": "To install the duckdb Python package:\n\n%pip install duckdb\n\nCommon model: connect to database, get a cursor. In Python, all database packages follow the DB-API standard, so they all look the same. See course website for pointer to DB-API.\n\nimport duckdb\n\n\nconn = duckdb.connect(\"database.db\")\n\nCursor mediates access to query, getting results. Can deal with one query at a time.\n\ncur = conn.cursor()\n\nGet all results. Cursor is streaming mechanism, does not store results.\n\ncur.execute(\"SELECT * FROM Camp_assignment LIMIT 3\")\ncur.fetchall()\n\n[(2005,\n 'bylo',\n 'lmckinnon',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 5)),\n (2005,\n 'bylo',\n 'blalibert',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 20)),\n (2006,\n 'bylo',\n 'lmckinnon',\n datetime.date(2006, 6, 1),\n datetime.date(2006, 8, 5))]\n\n\n\ncur.fetchall()\n\n[]\n\n\nOr get one row at a time\n\ncur.execute(\"SELECT * FROM Camp_assignment LIMIT 3\")\ncur.fetchone()\n\n(2005,\n 'bylo',\n 'lmckinnon',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 5))\n\n\n\ncur.fetchone()\n\n(2005,\n 'bylo',\n 'blalibert',\n datetime.date(2005, 6, 1),\n datetime.date(2005, 8, 20))\n\n\n\ncur.fetchone()\n\n(2006,\n 'bylo',\n 'lmckinnon',\n datetime.date(2006, 6, 1),\n datetime.date(2006, 8, 5))\n\n\n\ncur.fetchone()\n\nExtended example showing looping over cursor (DuckDB does not support direct iteration over cursor), using second cursor, using parameterized queries.\n\ninner_query = \"\"\"\n SELECT COUNT(*) AS num_nests\n FROM Bird_nests\n WHERE Observer = ?\n\"\"\"\n\nouter_query = \"\"\"\n SELECT DISTINCT Observer FROM Bird_nests\n\"\"\"\nfor row in cur.execute(outer_query).fetchall():\n observer = row[0]\n cur2 = conn.cursor()\n cur2.execute(inner_query, [observer])\n print(f\"Observer {observer} gathered {cur2.fetchone()[0]} nests\")\n\nObserver mballvanzee gathered 2 nests\nObserver dkessler gathered 69 nests\nObserver bharrington gathered 245 nests\nObserver lmckinnon gathered 249 nests\nObserver dhodkinson gathered 15 nests\nObserver mbwunder gathered 4 nests\nObserver None gathered 0 nests\nObserver kkalasz gathered 12 nests\nObserver bhill gathered 55 nests\nObserver ssaalfeld gathered 13 nests\nObserver wenglish gathered 18 nests\nObserver lworing gathered 14 nests\nObserver vloverti gathered 54 nests\nObserver rlanctot gathered 40 nests\nObserver abankert gathered 17 nests\nObserver edastrous gathered 38 nests\nObserver jzamuido gathered 11 nests\nObserver amould gathered 42 nests\nObserver bkaselow gathered 4 nests\nObserver jflamarre gathered 43 nests\n\n\n\nPandas\n\nimport pandas as pd\n\n\ndf = pd.read_sql(\"SELECT * FROM Site\", conn)\n\n/var/folders/rl/j368fbbx25l937pdxgzdpmxm0000gq/T/ipykernel_18456/2832309421.py:1: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.\n df = pd.read_sql(\"SELECT * FROM Site\", conn)\n\n\n\ndf\n\n\n\n\n\n\n\n\n\nCode\nSite_name\nLocation\nLatitude\nLongitude\nArea\n\n\n\n\n0\nbarr\nBarrow\nAlaska, USA\n71.300003\n-156.600006\n220.399994\n\n\n1\nburn\nBurntpoint Creek\nOntario, Canada\n55.200001\n-84.300003\n63.000000\n\n\n2\nbylo\nBylot Island\nNunavut, Canada\n73.199997\n-80.000000\n723.599976\n\n\n3\ncakr\nCape Krusenstern\nAlaska, USA\n67.099998\n-163.500000\n54.099998\n\n\n4\ncari\nCanning River Delta\nAlaska, USA\n70.099998\n-145.800003\n722.000000\n\n\n5\nchau\nChaun River Delta\nChukotka, Russia\n68.800003\n170.600006\n248.199997\n\n\n6\nchur\nChurchill\nManitoba, Canada\n58.700001\n-93.800003\n866.900024\n\n\n7\ncoat\nCoats Island\nNunavut, Canada\n62.900002\n-82.500000\n1239.099976\n\n\n8\ncolv\nColville River Delta\nAlaska, USA\n70.400002\n-150.699997\n324.799988\n\n\n9\neaba\nEast Bay\nNunavut, Canada\n64.000000\n-81.699997\n1205.500000\n\n\n10\niglo\nIgloolik\nNunavut, Canada\n69.400002\n-81.599998\n59.799999\n\n\n11\nikpi\nIkpikpuk\nAlaska, USA\n70.599998\n-154.699997\n174.100006\n\n\n12\nlkri\nLower Khatanga River\nKrasnoyarsk, Russia\n72.900002\n106.099998\n270.899994\n\n\n13\nmade\nMackenzie River Delta\nNorthwest Territories, Canada\n69.400002\n-135.000000\n667.299988\n\n\n14\nnome\nNome\nAlaska, USA\n64.400002\n-164.899994\n90.099998\n\n\n15\nprba\nPrudhoe Bay\nAlaska, USA\n70.300003\n-148.600006\n120.000000" }, { - "objectID": "modules/week03/index-03.html#resources", - "href": "modules/week03/index-03.html#resources", - "title": "Week 3 - Structured Query Language (SQL) & DuckDB", - "section": "Resources", - "text": "Resources\n\nhttp://swcarpentry.github.io/sql-novice-survey/\n\nGood Carpentry lesson, our lesson is drawn from this.\n\nC.J. Date and Hugh Darwen (1993). A Guide to the SQL Standard. 3rd ed. Reading, MA: Addison-Wesley.\nAccess via Library Catalog\n\nThe ANSI standard.\n\nJoe Celko (1995). Joe Celko’s SQL For Smarties: Advanced SQL Programming. San Francisco, CA: Morgan Kaufmann.\nAccess via Library Catalog\n\nThis guy is an SQL guru. Newer versions of this book are available online, check the Library catalog (a bug is preventing me from linking directly).\n\nGrant Allen and Mike Owens (2010). The Definitive Guide to SQLite. 2nd ed. Berkeley, CA: Apress.\nAccess via Library Catalog\n\nGood reference. Can access online!\n\nJeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.\nAccess via Library Catalog\n\nComplete but theoretical introduction to relational databases, data modeling, and relational algebra." + "objectID": "modules/week06/hw-06-1.html", + "href": "modules/week06/hw-06-1.html", + "title": "Week 6 - Little Bobby Tables", + "section": "", + "text": "View this classic XKCD cartoon:\nFor the purposes of this problem you may assume that at some point the school’s system performs the query\nwhere a student’s name, as input by a user of the system, is directly substituted for the %s. Explain exactly how Little Bobby Tables’ “name” can cause a catastrophe. Also, explain why his name has two dashes (--) at the end." }, { - "objectID": "modules/week03/index-03.html#homework", - "href": "modules/week03/index-03.html#homework", - "title": "Week 3 - Structured Query Language (SQL) & DuckDB", - "section": "Homework", - "text": "Homework\nSQL problem 1\nSQL problem 2\nSQL problem 3" + "objectID": "modules/week06/hw-06-1.html#bonus-problem", + "href": "modules/week06/hw-06-1.html#bonus-problem", + "title": "Week 6 - Little Bobby Tables", + "section": "Bonus problem!", + "text": "Bonus problem!\nHack your bird database! Let’s imagine that your Shiny application, in response to user input, executes the query\nSELECT * FROM Species WHERE Code = '%s';\nwhere a species code (supplied by the application user) is directly substituted for the query’s %s using Python interpolation. For example, an innocent user might input “wolv”. Craft an input that a devious user could use to:\n\nAdd Taylor Swift to the Personnel table\nYet still return the results of the query SELECT * FROM Species WHERE Code = 'wolv' (devious!)" }, { - "objectID": "modules/week03/hw-03-2.html", - "href": "modules/week03/hw-03-2.html", - "title": "Week 3 - SQL problem 2", + "objectID": "modules/week06/python-programming.html", + "href": "modules/week06/python-programming.html", + "title": "eds213", "section": "", - "text": "Part 1\nIf we want to know which site has the largest area, it’s tempting to say\nSELECT Site_name, MAX(Area) FROM Site;\nWouldn’t that be great? But DuckDB gives an error. And right it should! This query is conceptually flawed. Please describe what is wrong with this query. Don’t just quote DuckDB’s error message— explain why DuckDB is objecting to performing this query.\nTo help you answer this question, you may want to consider:\n\nTo the database, the above query is no different from\n\nSELECT Site_name, AVG(Area) FROM Site\nSELECT Site_name, COUNT(*) FROM Site\nSELECT Site_name, SUM(Area) FROM Site\n\nIn all these examples, the database sees that it is being asked to apply an aggregate function to a table column.\nWhen performing an aggregation, SQL wants to collapse the requested columns down to a single row. (For a table-level aggregation such as requested above, it wants to collapse the entire table down to a single row. For a GROUP BY, it wants to collapse each group down to a single row.)\n\n\n\nPart 2\nTime for plan B. Find the site name and area of the site having the largest area. Do so by ordering the rows in a particularly convenient order, and using LIMIT to select just the first row. Your result should look like:\n┌──────────────┬────────┐\n│ Site_name │ Area │\n│ varchar │ float │\n├──────────────┼────────┤\n│ Coats Island │ 1239.1 │\n└──────────────┴────────┘\nPlease submit your SQL.\n\n\nPart 3\nDo the same, but use a nested query. First, create a query that finds the maximum area. Then, create a query that selects the site name and area of the site whose area equals the maximum. Your overall query will look something like:\nSELECT Site_name, Area FROM Site WHERE Area = (SELECT ...);" + "text": "import duckdb\n\nExample of Jupyter “magic command”:\n\n%pwd\n\n'/Users/gjanee-local/Desktop/meds/bren-meds213-spring-2024-class-data/week3'\n\n\nTo install DuckDB Python module:\n\n# %pip install duckdb\n\n\nCreate a connection and a cursor\n\n\nconn = duckdb.connect(\"database.db\")\n\n\nconn\n\n<duckdb.duckdb.DuckDBPyConnection at 0x1040abb70>\n\n\n\ncur = conn.cursor()\n\nNow let’s do something with our cursor\n\ncur.execute(\"SELECT * FROM Site LIMIT 5\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\nNow we want results… three ways of getting them. 1. All results at once\n\ncur.fetchall()\n\n[('barr',\n 'Barrow',\n 'Alaska, USA',\n 71.30000305175781,\n -156.60000610351562,\n 220.39999389648438),\n ('burn',\n 'Burntpoint Creek',\n 'Ontario, Canada',\n 55.20000076293945,\n -84.30000305175781,\n 63.0),\n ('bylo',\n 'Bylot Island',\n 'Nunavut, Canada',\n 73.19999694824219,\n -80.0,\n 723.5999755859375),\n ('cakr',\n 'Cape Krusenstern',\n 'Alaska, USA',\n 67.0999984741211,\n -163.5,\n 54.099998474121094),\n ('cari',\n 'Canning River Delta',\n 'Alaska, USA',\n 70.0999984741211,\n -145.8000030517578,\n 722.0)]\n\n\nCursors don’t store anything, they just transfer queries to the database and get results back.\n\ncur.fetchall()\n\n[]\n\n\nAlways get tuples, even if you only request one column\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.fetchall()\n\n[('14HPE1',),\n ('11eaba',),\n ('11eabaagc01',),\n ('11eabaagv01',),\n ('11eababbc02',),\n ('11eababsv01',),\n ('11eabaduh01',),\n ('11eabaduv01',),\n ('11eabarpc01',),\n ('11eabarpc02',)]\n\n\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\n[t[0] for t in cur.fetchall()]\n\n['14HPE1',\n '11eaba',\n '11eabaagc01',\n '11eabaagv01',\n '11eababbc02',\n '11eababsv01',\n '11eabaduh01',\n '11eabaduv01',\n '11eabarpc01',\n '11eabarpc02']\n\n\n\nGet the one result, or the next result\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchall()\n\n[(1547,)]\n\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchone()\n\n(1547,)\n\n\n\ncur.execute(\"SELECT COUNT(*) FROM Bird_nests\")\ncur.fetchone()[0]\n\n1547\n\n\n\nUsing an iterator - but DuckDB doesn’t support iterators :(\n\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\nfor row in cur:\n print(f\"got {row[0]}\")\n\nTypeError: 'duckdb.duckdb.DuckDBPyConnection' object is not iterable\n\n\nA workaround:\n\ncur.execute(\"SELECT Nest_ID FROM Bird_nests LIMIT 10\")\nwhile True:\n row = cur.fetchone()\n if row == None:\n break\n # do something with row\n print(f\"got nest ID {row[0]}\")\n\ngot nest ID 14HPE1\ngot nest ID 11eaba\ngot nest ID 11eabaagc01\ngot nest ID 11eabaagv01\ngot nest ID 11eababbc02\ngot nest ID 11eababsv01\ngot nest ID 11eabaduh01\ngot nest ID 11eabaduv01\ngot nest ID 11eabarpc01\ngot nest ID 11eabarpc02\n\n\nCan do things other than SELECT!\n\ncur.execute(\"CREATE TEMP TABLE temp_table AS\n SELECT * FROM Bird_nests LIMIT 10\")\n\nSyntaxError: unterminated string literal (detected at line 1) (1747419494.py, line 1)\n\n\n\ncur.execute(\"\"\"\n CREATE TEMP TABLE temp_table AS\n SELECT * FROM Bird_nests LIMIT 10\n\"\"\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM temp_table\")\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.fetchall()\n\n[('b14.6',\n 2014,\n 'chur',\n '14HPE1',\n 'sepl',\n 'vloverti',\n datetime.date(2014, 6, 14),\n None,\n 3,\n None,\n None),\n ('b11.7',\n 2011,\n 'eaba',\n '11eaba',\n 'wrsa',\n 'bhill',\n datetime.date(2011, 7, 10),\n 'searcher',\n 4,\n None,\n None),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaagc01',\n 'amgp',\n 'dkessler',\n datetime.date(2011, 6, 24),\n 'searcher',\n 4,\n 6.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaagv01',\n 'amgp',\n 'dkessler',\n datetime.date(2011, 6, 25),\n 'searcher',\n 3,\n 3.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eababbc02',\n 'bbpl',\n 'dkessler',\n datetime.date(2011, 6, 24),\n 'searcher',\n 4,\n 4.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eababsv01',\n 'wrsa',\n 'bhill',\n datetime.date(2011, 7, 7),\n 'searcher',\n 4,\n 2.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaduh01',\n 'dunl',\n 'dkessler',\n datetime.date(2011, 6, 28),\n 'searcher',\n 3,\n 2.0,\n 'float'),\n ('b11.6',\n 2011,\n 'eaba',\n '11eabaduv01',\n 'dunl',\n 'dkessler',\n datetime.date(2011, 6, 29),\n 'searcher',\n 4,\n 5.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eabarpc01',\n 'reph',\n 'bhill',\n datetime.date(2011, 7, 8),\n 'searcher',\n 4,\n 4.0,\n 'float'),\n ('b11.7',\n 2011,\n 'eaba',\n '11eabarpc02',\n 'reph',\n 'bhill',\n datetime.date(2011, 7, 8),\n 'searcher',\n 3,\n 4.0,\n 'float')]\n\n\nA note on fragility\nFor example: INSERT INTO Site VALUES (“abcd”, “Foo”, 35.7, 42.3, “?”)\nA less fragile way of expressing the same thing: INSERT INTO Site (Code, Site_name, Latitude, Longitude, Something_else) VALUES (“abcd”, “Foo”, 35.7, 42.3, “?”)\nIn the same vein: SELECT * is fragile\n\ncur.execute(\"SELECT * FROM Site LIMIT 3\")\ncur.fetchall()\n\n[('barr',\n 'Barrow',\n 'Alaska, USA',\n 71.30000305175781,\n -156.60000610351562,\n 220.39999389648438),\n ('burn',\n 'Burntpoint Creek',\n 'Ontario, Canada',\n 55.20000076293945,\n -84.30000305175781,\n 63.0),\n ('bylo',\n 'Bylot Island',\n 'Nunavut, Canada',\n 73.19999694824219,\n -80.0,\n 723.5999755859375)]\n\n\nA better, more robust way of coding the same thing:\n\ncur.execute(\"SELECT Site_name, Code, Latitude, Longitude FROM Site LIMIT 3\")\ncur.fetchall()\n\n[('Barrow', 'barr', 71.30000305175781, -156.60000610351562),\n ('Burntpoint Creek', 'burn', 55.20000076293945, -84.30000305175781),\n ('Bylot Island', 'bylo', 73.19999694824219, -80.0)]\n\n\nAn extended example: Question we’re trying to answer: How many nests do we have for each species?\nApproach: first get all species. Then execute a count query for each species.\nA digression: string interpolation in Python\n\n# The % method\ns = \"My name is %s\"\nprint(s % \"Greg\")\ns = \"My name is %s and the other teacher's name is %s\"\nprint(s % (\"Greg\", \"Julien\"))\n# The new f-string method\nname = \"Greg\"\nprint(f\"My name is {name}\")\n# Third way\nprint(\"My name is {}\".format(\"Greg\"))\n\nMy name is Greg\nMy name is Greg and the other teacher's name is Julien\nMy name is Greg\nMy name is Greg\n\n\n\nquery = \"\"\"\n SELECT COUNT(*) FROM Bird_nests\n WHERE Species = '%s'\n\"\"\"\ncur.execute(\"SELECT Code FROM Species LIMIT 3\")\nfor row in cur.fetchall(): # DuckDB workaround\n code = row[0]\n prepared_query = query % code\n #print(prepared_query)\n cur2 = conn.cursor()\n cur2.execute(prepared_query)\n print(f\"Species {code} has {cur2.fetchone()[0]} nests\")\n cur2.close()\n\nSpecies agsq has 0 nests\nSpecies amcr has 0 nests\nSpecies amgp has 29 nests\n\n\nThe above Python interpolation is dangerous and has caused many database hacks! There’s a better way\n\nquery = \"\"\"\n SELECT COUNT(*) FROM Bird_nests\n WHERE Species = ?\n\"\"\"\ncur.execute(\"SELECT Code FROM Species LIMIT 3\")\nfor row in cur.fetchall(): # DuckDB workaround\n code = row[0]\n # NOT NEEDED! prepared_query = query % code\n #print(prepared_query)\n cur2 = conn.cursor()\n cur2.execute(query, [code]) # <-- added argument here\n print(f\"Species {code} has {cur2.fetchone()[0]} nests\")\n cur2.close()\n\nSpecies agsq has 0 nests\nSpecies amcr has 0 nests\nSpecies amgp has 29 nests\n\n\nLet’s illustrate the danger with a different example\n\nabbrev = \"TS\"\nname = \"Taylor Swift\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n )\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM Personnel\")\ncur.fetchall()[-3:]\n\n[('emagnuson', 'Emily Magnuson'),\n ('mcorrell', 'Maureen Correll'),\n ('TS', 'Taylor Swift')]\n\n\n\nabbrev = \"CO\"\nname = \"Conan O'Brien\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n )\n\nParserException: Parser Error: syntax error at or near \"Brien\"\n\n\n\n\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES ('%s', '%s')\n \"\"\" % (abbrev, name)\n\n\"\\n INSERT INTO Personnel (Abbreviation, Name)\\n VALUES ('CO', 'Conan O'Brien')\\n \"\n\n\n\nabbrev = \"CO\"\nname = \"Conan O'Brien\"\ncur.execute(\"\"\"\n INSERT INTO Personnel (Abbreviation, Name)\n VALUES (?, ?)\n \"\"\",\n [abbrev, name])\n\n<duckdb.duckdb.DuckDBPyConnection at 0x10406b6b0>\n\n\n\ncur.execute(\"SELECT * FROM Personnel\")\ncur.fetchall()[-3:]\n\n[('mcorrell', 'Maureen Correll'),\n ('TS', 'Taylor Swift'),\n ('CO', \"Conan O'Brien\")]" }, { - "objectID": "modules/week01/index-01.html", - "href": "modules/week01/index-01.html", - "title": "Week 1 - Relational databases and data modeling", + "objectID": "modules/week03/hw-03-1.html", + "href": "modules/week03/hw-03-1.html", + "title": "Week 3 - SQL problem 1", "section": "", - "text": "Benefits of relational databases\nRelational data model and SQL data definition\nData modeling" + "text": "It’s a useful skill in life (I’m not being rhetorical, I really mean that, it’s a useful skill) to be able to construct an experiment to answer a hypothesis. Suppose you’re not sure what the AVG function returns if there are NULL values in the column being averaged. Suppose you either didn’t have access to any documentation, or didn’t trust it. What experiment could you run to find out what happens?\nThere are two parts to this problem.\n\nPart 1\nConstruct an SQL experiment to determine the answer to the question above. Does SQL abort with some kind of error? Does it ignore NULL values? Do the NULL values somehow factor into the calculation, and if so, how?\nI would suggest you start by creating a table (in the bird database, in a new database, in a transient in-memory database, doesn’t matter) with a single column that has data type REAL (for part 2 below, it must be REAL). You can make your table a temp table or not, your choice.\nCREATE TEMP TABLE mytable... ;\nNow insert some real numbers and at least one NULL into your table.\nINSERT INTO mytable... ;\n(Hmm, can you insert multiple rows at once, or do you have to do a separate INSERT for each row?)\nOnce you have your little table constructed, try doing an AVG on the column and see what is returned. What would the average be if the function ignored NULLs? What would the average be if it somehow factored them in? What is actually returned?\nPlease submit both your SQL and your answer to the question about how AVG operates in the presence of NULL values.\n\n\nPart 2\nIf SQL didn’t have an AVG function, you could compute the average value of a column by doing something like this on your table:\nSELECT SUM(mycolumn)/COUNT(*) FROM mytable;\nSELECT SUM(mycolumn)/COUNT(mycolumn) FROM mytable;\nWhich query above is correct? Please explain why.\nNow that you’re done with your table, you can delete it if desired:\nDROP TABLE mytable;" }, { - "objectID": "modules/week01/index-01.html#learning-objectives", - "href": "modules/week01/index-01.html#learning-objectives", - "title": "Week 1 - Relational databases and data modeling", + "objectID": "modules/week03/hw-03-3.html", + "href": "modules/week03/hw-03-3.html", + "title": "Week 3 - SQL problem 3", "section": "", - "text": "Benefits of relational databases\nRelational data model and SQL data definition\nData modeling" - }, - { - "objectID": "modules/week01/index-01.html#slides", - "href": "modules/week01/index-01.html#slides", - "title": "Week 1 - Relational databases and data modeling", - "section": "Slides", - "text": "Slides\nslides-01.pptx" + "text": "Your mission is to list the scientific names of bird species in descending order of their maximum average egg volumes. That is, compute the average volume of the eggs in each nest, and then for the nests of each species compute the maximum of those average volumes, and list by species in descending order of maximum volume. You final table should look like:\n┌─────────────────────────┬────────────────────┐\n│ Scientific_name │ Max_avg_volume │\n│ varchar │ double │\n├─────────────────────────┼────────────────────┤\n│ Pluvialis squatarola │ 36541.8525390625 │\n│ Pluvialis dominica │ 33847.853515625 │\n│ Arenaria interpres │ 23338.6220703125 │\n│ Calidris fuscicollis │ 13277.143310546875 │\n│ Calidris alpina │ 12196.237548828125 │\n│ Charadrius semipalmatus │ 11266.974975585938 │\n│ Phalaropus fulicarius │ 8906.775146484375 │\n└─────────────────────────┴────────────────────┘\n(By the way, regarding the leader in egg size above, Birds of the World says that Pluvialis squatarola’s eggs are “Exceptionally large for size of female (ca. 16% weight of female)”.)\nTo calculate the volume of an egg, use the simplified formula\n\\[{\\pi \\over 6} W^2 L\\]\nwhere \\(W\\) is the egg width and \\(L\\) is the egg length. You can use 3.14 for \\(\\pi\\). (The real formula takes into account the ovoid shape of eggs, but only width and length are available to us here.)\nA good place to start is just to group bird eggs by nest (i.e., Nest_ID) and compute average volumes:\nCREATE TEMP TABLE Averages AS\n SELECT Nest_ID, AVG(...) AS Avg_volume\n FROM ...\n GROUP BY ...;\nYou can now join that table with Bird_nests, so that you can group by species, and also join with the Species table to pick up scientific names. To do just the first of those joins, you could say something like\nSELECT Species, MAX(...)\n FROM Bird_nests JOIN Averages USING (Nest_ID)\n GROUP BY ...;\n(Notice how, if the joined columns have the same name, you can more compactly say USING (common_column) instead of ON column_a = column_b.)\nThat’s not the whole story, we want scientific names not species codes. Another join is needed. A couple strategies here. One, you can modify the above query to also join with the Species table (you’ll need to replace USING with ON …). Two, you can save the above as another temp table and join it to Species separately.\nDon’t forget to order the results. Here it is convenient to give computed quantities nice names so you can refer to them.\nPlease submit all of the SQL you used to solve the problem. Bonus points if you can do all of the above in one statement." }, { - "objectID": "modules/week01/index-01.html#resources", - "href": "modules/week01/index-01.html#resources", - "title": "Week 1 - Relational databases and data modeling", - "section": "Resources", - "text": "Resources\n\nhttps://learning.nceas.ucsb.edu/2023-06-delta/session_09.html\n\nVery brief introduction to data modeling, ties into “tidy data.”\n\nChristoph Wohner, Johannes Peterseil, and Hermann Klug (2022). Designing and implementing a data model for describing environmental monitoring and research sites. Ecological Informatics 70, 101708.\nhttps://doi.org/10.1016/j.ecoinf.2022.101708\n\nGood case study.\n\nGerald A. Burnette (2022). Managing environmental data: principles, techniques, and best practices. CRC Press.\nAccess via Library Catalog\n\nComprehensive text, specific to environmental sciences.\n\nGraeme C. Simsion and Graham C. Witt (2005). Data Modeling Essentials. 3rd ed. Amsterdam: Morgan Kaufmann.\nAccess via Library Catalog\nGoogle Books\n\nComprehensive text, not specific to the environmental sciences.\n\nHartmut Hebbel (1994). Environmental data modeling. Annals of Operations Research 54, 263-278.\nhttps://doi.org/10.1007/BF02031737\n\nA broader view of data organization.\n\nJeffrey D. Ullman and Jennifer Widom (2008). A First Course in Database Systems. 3rd ed. Upper Saddle River, NJ: Pearson/Prentice Hall.\nAccess via Library Catalog\n\nComplete but theoretical introduction to relational databases, data modeling, and relational algebra." + "objectID": "modules/week01/hw-01-2.html", + "href": "modules/week01/hw-01-2.html", + "title": "Week 1 - Data modeling", + "section": "", + "text": "Please use Canvas to return the assignments: https://ucsb.instructure.com/courses/19301/assignments/224311\nCreate a table definition for the Snow_survey table that is maximally expressive, that is, that captures as much of the semantics and characteristics of the data using SQL’s data definition language as is possible.\nIn the class data GitHub repository, week 1 directory you will find the table described in the metadata (consult 01_ASDN_Readme.txt) and the data can be found in ASDN_Snow_survey.csv. You will want to look at the values that occur in the data using a tool like R, Python, or OpenRefine.\nPlease consider:\nYou may (or may not) want to take advantage of the Species, Site, Color_band_code, and Personnel supporting tables. These are also documented in the metadata, and SQL table definitions for them have already been created and are included below.\nPlease express your table definition in SQL, but don’t worry about getting the SQL syntax exactly correct. This assignment is just a thought exercise. If you do want to try to write correct SQL, though, your may find it helpful to consult the DuckDB CREATE TABLE documentation.\nFinally, please provide some explanation for why you made the choices you did, and any questions or uncertainties you have. Don’t write an essay! Bullet points are sufficient. But do please explain your thought process." }, { - "objectID": "modules/week01/index-01.html#homework", - "href": "modules/week01/index-01.html#homework", - "title": "Week 1 - Relational databases and data modeling", - "section": "Homework", - "text": "Homework\nCreate an ER diagram\nData modeling exercise" + "objectID": "modules/week01/hw-01-2.html#appendix", + "href": "modules/week01/hw-01-2.html#appendix", + "title": "Week 1 - Data modeling", + "section": "Appendix", + "text": "Appendix\nCREATE TABLE Species (\n Code TEXT PRIMARY KEY,\n Common_name TEXT UNIQUE NOT NULL,\n Scientific_name TEXT,\n Relevance TEXT\n);\n\nCREATE TABLE Site (\n Code TEXT PRIMARY KEY,\n Site_name TEXT UNIQUE NOT NULL,\n Location TEXT NOT NULL,\n Latitude REAL NOT NULL CHECK (Latitude BETWEEN -90 AND 90),\n Longitude REAL NOT NULL CHECK (Longitude BETWEEN -180 AND 180),\n \"Total_Study_Plot_Area_(ha)\" REAL NOT NULL\n CHECK (\"Total_Study_Plot_Area_(ha)\" > 0),\n UNIQUE (Latitude, Longitude)\n);\n\nCREATE TABLE Color_band_code (\n Code TEXT PRIMARY KEY,\n Color TEXT NOT NULL UNIQUE\n);\n\nCREATE TABLE Personnel (\n Abbreviation TEXT PRIMARY KEY,\n Name TEXT NOT NULL UNIQUE\n);" }, { - "objectID": "modules/week01/hw-01-1.html", - "href": "modules/week01/hw-01-1.html", - "title": "Week 1 - Create an ER diagram", + "objectID": "modules/hw_bonus.html", + "href": "modules/hw_bonus.html", + "title": "Bonus Homework", "section": "", - "text": "Please use Canvas to return the assignments: https://ucsb.instructure.com/courses/19301/assignments/236835\nCreate a physical ER (entity-relationship) diagram for the Harry Potter tables shown in class. It will be helpful to refer back to the slides.\nAs discussed briefly in class, a logical or conceptual ER diagram focuses on high-level abstractions, and doesn’t address how entities and relationships actually get implemented. In particular, in a logical ER diagram a many-to-many relationship between two entities might be represented by a simple line, even though in implementation a many-to-many relationship requires a separate table to store the relationship tuples. By contrast, a physical ER diagram describes actual tables. You are being asked to create a physical ER diagram.\nRequirements:\n\nYour diagram should include Student, House, Wand, Course, and Enrollment tables.\nEach table should list the name of the entity, any attributes, and which attribute(s) form the primary key, if there is one.\nA foreign key relationship from an attribute in one table to an attribute in another table should be indicated by a line between the two attributes. The ends of the lines should reflect the cardinalities at each end. See the example below.\n\nTwist #1! The slides shown in class demonstrated a many-to-one relationship between wands and students, i.e., one student might own multiple wands, but any given wand has only one owner. However, for this exercise, you are being asked to model a many-to-many relationship between wands and students (it happened in the books that the same wand was used by different students, though at different times, of course). To create a many-to-many relationship, you will need to invent an intermediate table that represents the student-wand ownership relation, in the same way the Enrollment table intermediates between the Student and Course tables.\nTwist #2! You must also store the date range (i.e., begin date and end date) of wand ownership. You will need to think where these date attributes belong. Are they attributes of a student? Of a wand? Of something else?\nVarious symbologies have been developed for ER diagrams. For this assignment, represent the “one” side of a many-to-one relationship by a single vertical bar, and represent the “many” side by a so-called crow’s foot. In the end, your diagram should visually resemble something like this:\n\nYou can use a tool like dbdiagram.io as was used to create the above diagram, or any other drawing tool. Or you can just draw it by hand and take a picture with your phone. Regardless of the method, be sure to indicate primary keys somehow (bold text, underlined text, add “PK” next to the attribute, etc., whatever works visually)." + "text": "One might wonder if egg volumes are larger during warmer months. Indeed, in Egg size variation within passerine clutches: effects of ambient temperature and laying sequence 1, the authors report that:\n\nSlight but statistically significant positive correlations were detected between daily temperatures (mostly mean and minimum) and egg size. The first eggs of the clutch were often affected by the temperatures occurring about a week before they were laid. These temperatures probably influence the development of the insects from eggs and pupae providing protein for the egg-forming female. The last eggs of the clutch tended to be affected by the temperatures prevailing one to three days before laying, i.e.occurring in the most intensive period of egg formation.\n\nThere are multiple factors at play here, including clutch size and laying order, and we don’t have much data to work with using our class database, but still, we can investigate if there is any change in average egg volume between the months of June and July, hypothesizing that July is warmer than June.\nPlease submit your SQL code and your Python or R notebook." }, { - "objectID": "modules/week10/index-10.html", - "href": "modules/week10/index-10.html", - "title": "Week 10 - Data licensing and publication", + "objectID": "modules/hw_bonus.html#relationship-between-egg-volume-and-time-of-the-year", + "href": "modules/hw_bonus.html#relationship-between-egg-volume-and-time-of-the-year", + "title": "Bonus Homework", "section": "", - "text": "Become familiar with data licensing in the academic context.\nDistinguish which research deliverables can or cannot be subject to copyright and explore alternatives to that.\nGain an understanding of the Creative Commons license family and the differences in their applicability.\n\n\n\n\nslides-10-part1.pptx" + "text": "One might wonder if egg volumes are larger during warmer months. Indeed, in Egg size variation within passerine clutches: effects of ambient temperature and laying sequence 1, the authors report that:\n\nSlight but statistically significant positive correlations were detected between daily temperatures (mostly mean and minimum) and egg size. The first eggs of the clutch were often affected by the temperatures occurring about a week before they were laid. These temperatures probably influence the development of the insects from eggs and pupae providing protein for the egg-forming female. The last eggs of the clutch tended to be affected by the temperatures prevailing one to three days before laying, i.e.occurring in the most intensive period of egg formation.\n\nThere are multiple factors at play here, including clutch size and laying order, and we don’t have much data to work with using our class database, but still, we can investigate if there is any change in average egg volume between the months of June and July, hypothesizing that July is warmer than June.\nPlease submit your SQL code and your Python or R notebook." }, { - "objectID": "modules/week10/index-10.html#part-i---data-licensing", - "href": "modules/week10/index-10.html#part-i---data-licensing", - "title": "Week 10 - Data licensing and publication", - "section": "", - "text": "Become familiar with data licensing in the academic context.\nDistinguish which research deliverables can or cannot be subject to copyright and explore alternatives to that.\nGain an understanding of the Creative Commons license family and the differences in their applicability.\n\n\n\n\nslides-10-part1.pptx" + "objectID": "modules/hw_bonus.html#step-1", + "href": "modules/hw_bonus.html#step-1", + "title": "Bonus Homework", + "section": "Step 1", + "text": "Step 1\nCreate a query to compute and group average egg volume by species and month. As before, use for volume the formula\n\\[{\\pi \\over 6} W^2 L\\]\nwhere \\(W\\) is egg width and \\(L\\) is egg length, and use 3.14 for \\(\\pi\\). Call this table T." }, { - "objectID": "modules/week10/index-10.html#part-ii---data-publication", - "href": "modules/week10/index-10.html#part-ii---data-publication", - "title": "Week 10 - Data licensing and publication", - "section": "Part II - Data publication", - "text": "Part II - Data publication\n\nLearning goals\n\nUnderstand the importance of publishing research data.\nIdentify and select appropriate approaches to data publication.\nExplain the role of persistent identifiers.\n\n\n\nSlides\nslides-10-part2.pptx" + "objectID": "modules/hw_bonus.html#step-2", + "href": "modules/hw_bonus.html#step-2", + "title": "Bonus Homework", + "section": "Step 2", + "text": "Step 2\nLooking at table T, you’ll notice that we have egg data for months 6 and 7 for most species, but there is one species for which there is only data for month 6. We want to exclude all such species since there will be nothing to plot for them. How to do that? Here’s a hint. First, create a query that identifies the set of species having 2 rows in T. Then, select the rows from T where the species is in the aforementioned set.\nJoin this reduced table with the Species table to grab scientific names, and write out to a CSV file." }, { - "objectID": "modules/week10/index-10.html#suggested-readings", - "href": "modules/week10/index-10.html#suggested-readings", - "title": "Week 10 - Data licensing and publication", - "section": "Suggested readings*", - "text": "Suggested readings*\n\nCarroll, M. W. (2015) Sharing Research Data and Intellectual Property Law: A Primer. PLoS Biol 13(8): e1002235. https://doi.org/10.1371/journal.pbio.1002235\nFay, C. (2019). Licensing R. https://thinkr-open.github.io/licensing-r\nReitz, K., & Schlusser, T. (2022). The Hitchhiker’s guide to Python: best practices for development. ” O’Reilly Media, Inc.”. https://docs.python-guide.org/writing/license\n\n*Useful links and other supporting materials are noted in the slides." + "objectID": "modules/hw_bonus.html#step-3", + "href": "modules/hw_bonus.html#step-3", + "title": "Bonus Homework", + "section": "Step 3", + "text": "Step 3\nUse R or Python to plot average egg volume as a function of month, by species. An example is shown below." }, { - "objectID": "modules/week10/index-10.html#homework", - "href": "modules/week10/index-10.html#homework", - "title": "Week 10 - Data licensing and publication", - "section": "Homework", - "text": "Homework\nNo homework this week!" + "objectID": "modules/hw_bonus.html#footnotes", + "href": "modules/hw_bonus.html#footnotes", + "title": "Bonus Homework", + "section": "Footnotes", + "text": "Footnotes\n\n\nMikko Ojanen, Markku Orell, and Risto A. Väisänen (1981). Egg size variation within passerine clutches: effects of ambient temperature and laying sequence. Ornis Fennica 58:93-108. https://ornisfennica.journal.fi/article/view/133071↩︎" }, { - "objectID": "modules/week09/whale-sdcexercise.html", - "href": "modules/week09/whale-sdcexercise.html", - "title": "sdcmicro-exercise", + "objectID": "modules/week09/hw-09.html", + "href": "modules/week09/hw-09.html", + "title": "Week 9 - Whale entanglement sdcMicro exercise", "section": "", - "text": "Whale Entanglement sdcMicro Exercise\nYour team acquired a dataset* whale-sdc.csv from researchers working with whale entanglement data on the West Coast. The dataset contains both direct and indirect identifiers. Your task is to assess the risk of re-identification of the fisheries associated with the cases before considering public release. Then, you should test one technique and apply k-anonymization to help lower the disclosure risk as well as compute the information loss.\nPlease complete this exercise in pairs or groups of three. Each group should download the dataset and complete the rmd file, including the code and answering the questions. Remember to include your names in the YAML.\n*This dataset was purposefully adapted exclusively for instruction use.\n\nSetup\n\n\nPackage & Data\n\n\nInspect the Dataset\n\n\nQ1. How many direct identifiers are present in this dataset? What are they?\nA:\n\n\nQ2. What attributes would you consider quasi-identifiers? Why?\nA:\n\n\nQ3. What types of variables are they? Define them. (numeric, integer, factor or string)\nMake sure to have them set correctly.\n\n\n4 Considering your answers to questions 1, 2 and 3 create a SDC problem.\n\n\nQ4.1 What is the risk of re-identification for this dataset?\n\n\nQ4.2 To what extent does this dataset violate k-anonymity?\n\n\n5. Consider techniques that could reduce the risk of re-identification.\n\n\nQ5.1 Apply one non-perturbative method to a variable of your choice. How effective was it in lowering the disclosure risk?\n\n\nQ5.2 Apply ( k-3) anonymization to this dataset.\n\n\nQ6. Compute the information loss for the de-identified version of the dataset." + "text": "Your team has successfully obtained a dataset1 that encompasses whale entanglement data associated with specific fisheries along the West Coast. This dataset, named whale-sdc.csv, and an accompanying file called whale-exercise.Rmd.\nIn groups of two or three, your task is to thoroughly examine the dataset and complete the provided R Markdown file. This entails implementing the necessary code and addressing the given questions. To ensure proper identification, please include the names of all participating members in the YAML header before submitting the modified R Markdown file." + }, + { + "objectID": "modules/week09/hw-09.html#footnotes", + "href": "modules/week09/hw-09.html#footnotes", + "title": "Week 9 - Whale entanglement sdcMicro exercise", + "section": "Footnotes", + "text": "Footnotes\n\n\nThis dataset was purposefully adapted exclusively for instruction use.↩︎" }, { "objectID": "modules/week05/hw-05-3.html", diff --git a/sitemap.xml b/sitemap.xml index 3f4b424..f80213b 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,178 +2,174 @@ https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week02/index-02.html - 2024-05-22T00:51:14.754Z + 2024-05-22T00:52:09.937Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week07/index-07.html - 2024-05-22T00:51:14.778Z + 2024-05-22T00:52:09.961Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week07/hw-07-2.html - 2024-05-22T00:51:14.778Z + 2024-05-22T00:52:09.961Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/bash-essentials.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/hw-05-4.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/hw-05-2.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week09/index-09.html - 2024-05-22T00:51:14.862Z + 2024-05-22T00:52:10.049Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week09/hw-09.html - 2024-05-22T00:51:14.862Z - - - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/hw_bonus.html - 2024-05-22T00:51:14.746Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week10/index-10.html + 2024-05-22T00:52:10.069Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/hw-01-2.html - 2024-05-22T00:51:14.750Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/hw-01-1.html + 2024-05-22T00:52:09.933Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-3.html - 2024-05-22T00:51:14.754Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/index-01.html + 2024-05-22T00:52:09.933Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-1.html - 2024-05-22T00:51:14.754Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-2.html + 2024-05-22T00:52:09.937Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/python-programming.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/index-03.html + 2024-05-22T00:52:09.937Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-1.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/index-06.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/python-programming-cont.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-2.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-3.html - 2024-05-22T00:51:14.766Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-3.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/index-04.html - 2024-05-22T00:51:14.766Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-1.html + 2024-05-22T00:52:09.949Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-a.html - 2024-05-22T00:51:14.846Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-2.html + 2024-05-22T00:52:09.949Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-d.html - 2024-05-22T00:51:14.846Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-b.html + 2024-05-22T00:52:10.029Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/index-08.html - 2024-05-22T00:51:14.846Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-c.html + 2024-05-22T00:52:10.029Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/syllabus.html - 2024-05-22T00:51:14.938Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/index.html + 2024-05-22T00:52:09.929Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/resources.html - 2024-05-22T00:51:14.938Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/installing-duckdb.html + 2024-05-22T00:52:09.929Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/about.html - 2024-05-22T00:51:14.742Z + 2024-05-22T00:52:09.929Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/installing-duckdb.html - 2024-05-22T00:51:14.746Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/resources.html + 2024-05-22T00:52:10.125Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/index.html - 2024-05-22T00:51:14.746Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/syllabus.html + 2024-05-22T00:52:10.125Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-c.html - 2024-05-22T00:51:14.846Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/index-08.html + 2024-05-22T00:52:10.029Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-b.html - 2024-05-22T00:51:14.846Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-d.html + 2024-05-22T00:52:10.029Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-2.html - 2024-05-22T00:51:14.766Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week08/case-a.html + 2024-05-22T00:52:10.029Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-1.html - 2024-05-22T00:51:14.766Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/index-04.html + 2024-05-22T00:52:09.949Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-3.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week04/hw-04-3.html + 2024-05-22T00:52:09.949Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-2.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/python-programming-cont.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/index-06.html - 2024-05-22T00:51:14.774Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/hw-06-1.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/index-03.html - 2024-05-22T00:51:14.754Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week06/python-programming.html + 2024-05-22T00:52:09.957Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-2.html - 2024-05-22T00:51:14.754Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-1.html + 2024-05-22T00:52:09.937Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/index-01.html - 2024-05-22T00:51:14.750Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week03/hw-03-3.html + 2024-05-22T00:52:09.937Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/hw-01-1.html - 2024-05-22T00:51:14.750Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week01/hw-01-2.html + 2024-05-22T00:52:09.933Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week10/index-10.html - 2024-05-22T00:51:14.886Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/hw_bonus.html + 2024-05-22T00:52:09.929Z - https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week09/whale-sdcexercise.html - 2024-05-22T00:51:14.886Z + https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week09/hw-09.html + 2024-05-22T00:52:10.049Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/hw-05-3.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/hw-05-1.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week05/index-05.html - 2024-05-22T00:51:14.766Z + 2024-05-22T00:52:09.949Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week07/virtual-environments-notes.html - 2024-05-22T00:51:14.846Z + 2024-05-22T00:52:10.029Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week07/hw-07-1.html - 2024-05-22T00:51:14.778Z + 2024-05-22T00:52:09.961Z https://UCSB-Library-Research-Data-Services.github.io/bren-meds213-spring-2024/modules/week02/hw-02.html - 2024-05-22T00:51:14.754Z + 2024-05-22T00:52:09.937Z