Skip to content

Commit

Permalink
Updated README and examples
Browse files Browse the repository at this point in the history
  • Loading branch information
davidmezzetti committed Aug 11, 2020
1 parent ff6c0aa commit 8ca4a1f
Show file tree
Hide file tree
Showing 3 changed files with 106 additions and 106 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ The easiest way to install is via pip and PyPI

pip install txtai

You can also install txtai directly from GitHub using pip. Using a Python Virtual Environment is recommended.
You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.

pip install git+https://github.com/neuml/txtai

Expand Down
104 changes: 52 additions & 52 deletions examples/03_Build_an_Embeddings_index_from_a_data_source.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"base_uri": "https://localhost:8080/",
"height": 228
},
"outputId": "6bbacf83-d695-42ec-cce0-0f92b1534ca4"
"outputId": "5088c2f3-e47b-4026-a306-519b51858be8"
},
"source": [
"!wget https://www.kaggleusercontent.com/kf/40510829/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..FX7Ote_I-Y88MBPQHRIdUQ.tTr7P3B_eUL_yWN33Usz0Rk1KXtc4DjT_cdjkl5W4WbEcZ-0FJX2jSWTHYMVACtLYuMJJrf6eJN28OzWhDMnTysBu3wfDrd4ly5bu_wJKCnZajICQgQHs_b8hbRVMOzfdG6xEyl9CVYnZNU2cI3QuOshcWxoB0skdKD4d26O_Q4e_nrd8DqEixP47tI2Hu1F00w0vMykzgNwp7SwQ2Z9HoNCO8HtmcjEHq0A4lZ4303YkpjORtZQEO3S-j54fFlIAahT-9VvsFNofitK5VAlR0EyG9r3cOqh2LQDCL7kj5p3MxG8dvHmrTqggLVOwiuKHUIH8u59TemSMLsNRS29W-5fFlHfaItV4dEuiBxCIgQXHcKUDCDGEjeFcPgqpJnNHsnh0pebWDuRQR_fdQ-r8mWgN9qLnosrFBak9tM25G7gqxyUI90GMWAUyP4yj2EAEc8asX9rUsirC8QDHmrmOCUe0cmZvodRUi0ss7lTiLTwm55d9VPXjQn4jQ6tFs-dmjXEx0AwF2Mw1c1jhgzCXwgQj6ybUKemr_6wj1VFYj3VVvCXpk1nZObl-IB6-m7v5CIoXGLot_KFsVtyItRk-wX-B_L3W3aS9dOIfb7bX4s5_aNzXaDKvxrcafwlOQui.vS_FL4EArO8rkBo3xpDF2w/articles.sqlite"
Expand All @@ -88,16 +88,16 @@
{
"output_type": "stream",
"text": [
"--2020-08-11 15:42:03-- https://www.kaggleusercontent.com/kf/40510829/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..FX7Ote_I-Y88MBPQHRIdUQ.tTr7P3B_eUL_yWN33Usz0Rk1KXtc4DjT_cdjkl5W4WbEcZ-0FJX2jSWTHYMVACtLYuMJJrf6eJN28OzWhDMnTysBu3wfDrd4ly5bu_wJKCnZajICQgQHs_b8hbRVMOzfdG6xEyl9CVYnZNU2cI3QuOshcWxoB0skdKD4d26O_Q4e_nrd8DqEixP47tI2Hu1F00w0vMykzgNwp7SwQ2Z9HoNCO8HtmcjEHq0A4lZ4303YkpjORtZQEO3S-j54fFlIAahT-9VvsFNofitK5VAlR0EyG9r3cOqh2LQDCL7kj5p3MxG8dvHmrTqggLVOwiuKHUIH8u59TemSMLsNRS29W-5fFlHfaItV4dEuiBxCIgQXHcKUDCDGEjeFcPgqpJnNHsnh0pebWDuRQR_fdQ-r8mWgN9qLnosrFBak9tM25G7gqxyUI90GMWAUyP4yj2EAEc8asX9rUsirC8QDHmrmOCUe0cmZvodRUi0ss7lTiLTwm55d9VPXjQn4jQ6tFs-dmjXEx0AwF2Mw1c1jhgzCXwgQj6ybUKemr_6wj1VFYj3VVvCXpk1nZObl-IB6-m7v5CIoXGLot_KFsVtyItRk-wX-B_L3W3aS9dOIfb7bX4s5_aNzXaDKvxrcafwlOQui.vS_FL4EArO8rkBo3xpDF2w/articles.sqlite\n",
"--2020-08-11 16:29:37-- https://www.kaggleusercontent.com/kf/40510829/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..FX7Ote_I-Y88MBPQHRIdUQ.tTr7P3B_eUL_yWN33Usz0Rk1KXtc4DjT_cdjkl5W4WbEcZ-0FJX2jSWTHYMVACtLYuMJJrf6eJN28OzWhDMnTysBu3wfDrd4ly5bu_wJKCnZajICQgQHs_b8hbRVMOzfdG6xEyl9CVYnZNU2cI3QuOshcWxoB0skdKD4d26O_Q4e_nrd8DqEixP47tI2Hu1F00w0vMykzgNwp7SwQ2Z9HoNCO8HtmcjEHq0A4lZ4303YkpjORtZQEO3S-j54fFlIAahT-9VvsFNofitK5VAlR0EyG9r3cOqh2LQDCL7kj5p3MxG8dvHmrTqggLVOwiuKHUIH8u59TemSMLsNRS29W-5fFlHfaItV4dEuiBxCIgQXHcKUDCDGEjeFcPgqpJnNHsnh0pebWDuRQR_fdQ-r8mWgN9qLnosrFBak9tM25G7gqxyUI90GMWAUyP4yj2EAEc8asX9rUsirC8QDHmrmOCUe0cmZvodRUi0ss7lTiLTwm55d9VPXjQn4jQ6tFs-dmjXEx0AwF2Mw1c1jhgzCXwgQj6ybUKemr_6wj1VFYj3VVvCXpk1nZObl-IB6-m7v5CIoXGLot_KFsVtyItRk-wX-B_L3W3aS9dOIfb7bX4s5_aNzXaDKvxrcafwlOQui.vS_FL4EArO8rkBo3xpDF2w/articles.sqlite\n",
"Resolving www.kaggleusercontent.com (www.kaggleusercontent.com)... 35.190.26.106\n",
"Connecting to www.kaggleusercontent.com (www.kaggleusercontent.com)|35.190.26.106|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 8065024 (7.7M) [application/octet-stream]\n",
"Saving to: ‘articles.sqlite’\n",
"\n",
"\rarticles.sqlite 0%[ ] 0 --.-KB/s \rarticles.sqlite 100%[===================>] 7.69M --.-KB/s in 0.07s \n",
"\rarticles.sqlite 0%[ ] 0 --.-KB/s \rarticles.sqlite 100%[===================>] 7.69M --.-KB/s in 0.08s \n",
"\n",
"2020-08-11 15:42:03 (109 MB/s) - ‘articles.sqlite’ saved [8065024/8065024]\n",
"2020-08-11 16:29:38 (101 MB/s) - ‘articles.sqlite’ saved [8065024/8065024]\n",
"\n"
],
"name": "stdout"
Expand Down Expand Up @@ -130,7 +130,7 @@
"base_uri": "https://localhost:8080/",
"height": 156
},
"outputId": "67406a51-0819-4845-84da-63d0ad6677f1"
"outputId": "6bb1973b-b8c3-483c-83ce-e234360f48df"
},
"source": [
"import os\n",
Expand Down Expand Up @@ -175,9 +175,9 @@
"Building 300 dimension model\n",
"Converting vectors to magnitude format\n",
"total 9024\n",
"-rw-r--r-- 1 root root 8065024 Aug 11 15:42 articles.sqlite\n",
"-rw-r--r-- 1 root root 360448 Aug 11 15:43 cord19-300d.magnitude\n",
"-rw-r--r-- 1 root root 807886 Aug 11 15:43 cord19-300d.txt\n",
"-rw-r--r-- 1 root root 8065024 Aug 11 16:29 articles.sqlite\n",
"-rw-r--r-- 1 root root 360448 Aug 11 16:30 cord19-300d.magnitude\n",
"-rw-r--r-- 1 root root 807886 Aug 11 16:30 cord19-300d.txt\n",
"drwxr-xr-x 1 root root 4096 Jul 30 16:30 sample_data\n"
],
"name": "stdout"
Expand Down Expand Up @@ -205,7 +205,7 @@
"base_uri": "https://localhost:8080/",
"height": 52
},
"outputId": "f00ead87-aeaa-4243-8b68-663c01f30520"
"outputId": "c0ed68b5-0ae2-4d05-eb49-7d02abc978e5"
},
"source": [
"import sqlite3\n",
Expand Down Expand Up @@ -281,7 +281,7 @@
"source": [
"# Query data\n",
"\n",
"The following runs a query against the embeddings index for the terms \"comorbidities risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match."
"The following runs a query against the embeddings index for the terms \"risk factors\". It finds the top 5 matches and returns the corresponding documents associated with each match."
]
},
{
Expand All @@ -293,7 +293,7 @@
"base_uri": "https://localhost:8080/",
"height": 293
},
"outputId": "d0e9afef-f761-450c-ec9b-2724b6aa0922"
"outputId": "77225b0a-0bed-42f9-a850-ebc58629fc95"
},
"source": [
"import pandas as pd\n",
Expand All @@ -306,7 +306,7 @@
"cur = db.cursor()\n",
"\n",
"results = []\n",
"for uid, score in embeddings.search(\"comorbidities risk factors\", 5):\n",
"for uid, score in embeddings.search(\"risk factors\", 5):\n",
" cur.execute(\"SELECT article, text FROM sections WHERE id = ?\", [uid])\n",
" uid, text = cur.fetchone()\n",
"\n",
Expand Down Expand Up @@ -349,22 +349,22 @@
" <td>The identification of risk factors for contracting COVID-19 is crucial, to inform public health policy and to facilitate the appropriate distribution of healthcare resources.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
" <td>2020-04-27 00:00:00</td>\n",
" <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
" <td>Of interest, apoE4 has also been associated with some of the comorbid risk factors associated with severe COVID-19, such as atherosclerosis and hypertension .</td>\n",
" <td>Quantitative evaluation of olfactory dysfunction in hospitalized patients with Coronavirus [2] (COVID-19)</td>\n",
" <td>2020-05-25 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32451613/</td>\n",
" <td>In addition, these reports included patients with minor COVID-19 symptoms and low-risk factor burden.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
" <td>2020-07-23 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
" <td>Number of comorbidity groupings were then summed, and categorised (0–1; 2; 3 or ≥ 4 comorbidity groupings).</td>\n",
" <td>COVID-19 from the perspective of urban and rural general adult mental health services</td>\n",
" <td>2020-05-21 00:00:00</td>\n",
" <td>https://doi.org/10.1017/ipm.2020.62</td>\n",
" <td>At-risk groups among staff members and service users were identified early and prioritised in service changes.</td>\n",
" </tr>\n",
" <tr>\n",
" <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
" <td>2020-07-23 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
" <td>Number of comorbidity groupings were then summed, and categorised (0-1; 2; 3 or ≥ 4 comorbidity groupings).</td>\n",
" <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
" <td>2020-05-21 00:00:00</td>\n",
" <td>https://doi.org/10.1002/cpt.1910</td>\n",
" <td>Consistently, a recent report indicated diabetes as a risk factor significantly associated with COVID-19 unfavourable clinical outcomes (37) .</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>"
Expand Down Expand Up @@ -417,30 +417,30 @@
"base_uri": "https://localhost:8080/",
"height": 293
},
"outputId": "15c337b5-7c6d-4b5e-98aa-e3996a799737"
"outputId": "d01afea0-63a1-4e8e-806b-c9f6702b43b9"
},
"source": [
"db = sqlite3.connect(\"articles.sqlite\")\n",
"cur = db.cursor()\n",
"\n",
"results = []\n",
"for uid, score in embeddings.search(\"comorbidities risk factors\", 5):\n",
"for uid, score in embeddings.search(\"risk factors\", 5):\n",
" cur.execute(\"SELECT article, text FROM sections WHERE id = ?\", [uid])\n",
" uid, text = cur.fetchone()\n",
"\n",
" # Get list of document text sections to use for the context\n",
" cur.execute(\"SELECT Id, Name, Text FROM sections WHERE (labels is null or labels NOT IN ('FRAGMENT', 'QUESTION')) AND article = ?\", [uid])\n",
" sections = []\n",
" for sid, name, text in cur.fetchall():\n",
" for sid, name, txt in cur.fetchall():\n",
" if not name or not re.search(r\"background|(?<!.*?results.*?)discussion|introduction|reference\", name.lower()):\n",
" sections.append((sid, text))\n",
" sections.append((sid, txt))\n",
"\n",
" cur.execute(\"SELECT Title, Published, Reference from articles where id = ?\", [uid])\n",
" article = cur.fetchone()\n",
"\n",
" # Use QA extractor to derive additional columns\n",
" answers = extractor(sections, [(\"Risk Factors\", \"risk factor\", \"What are names of risk factors?\", False),\n",
" (\"Locations\", \"city country state\", \"What locations?\", False)])\n",
" answers = extractor(sections, [(\"Risk Factors\", \"risk factors\", \"What risk factors?\", False),\n",
" (\"Locations\", \"hospital country\", \"What locations?\", False)])\n",
"\n",
" results.append(article + (text,) + tuple([answer[1] for answer in answers]))\n",
"\n",
Expand Down Expand Up @@ -472,40 +472,40 @@
" <td>Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection</td>\n",
" <td>2020-04-24 00:00:00</td>\n",
" <td>http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1</td>\n",
" <td>Stratified by Troponin Levels, N = 2736 All Patients All rights reserved.</td>\n",
" <td>no CVD, and neither CVD nor risk factors</td>\n",
" <td>Mount Sinai Health System</td>\n",
" <td>This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.</td>\n",
" <td>neither CVD nor risk factors</td>\n",
" <td>New York City hospitals</td>\n",
" </tr>\n",
" <tr>\n",
" <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
" <td>2020-07-23 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
" <td>The age range at baseline within our sample was 40–69 years, but it is important to note that this was during 2006–2010, and the age range at COVID-19 diagnosis was substantially older (50–84 years).</td>\n",
" <td>The identification of risk factors for contracting COVID-19 is crucial, to inform public health policy and to facilitate the appropriate distribution of healthcare resources.</td>\n",
" <td>Frailty and multimorbidity</td>\n",
" <td>None</td>\n",
" <td>hospital settings</td>\n",
" </tr>\n",
" <tr>\n",
" <td>Does apolipoprotein E genotype predict COVID-19 severity?</td>\n",
" <td>2020-04-27 00:00:00</td>\n",
" <td>https://doi.org/10.1093/qjmed/hcaa142</td>\n",
" <td>If so, this group could be targeted more aggressively from the outset of the disease.</td>\n",
" <td>None</td>\n",
" <td>None</td>\n",
" <td>Quantitative evaluation of olfactory dysfunction in hospitalized patients with Coronavirus [2] (COVID-19)</td>\n",
" <td>2020-05-25 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32451613/</td>\n",
" <td>In addition, these reports included patients with minor COVID-19 symptoms and low-risk factor burden.</td>\n",
" <td>patients with minor COVID-19 symptoms and low-risk factor burden</td>\n",
" <td>COVID-19 wards</td>\n",
" </tr>\n",
" <tr>\n",
" <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
" <td>2020-07-23 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
" <td>The age range at baseline within our sample was 40–69 years, but it is important to note that this was during 2006–2010, and the age range at COVID-19 diagnosis was substantially older (50–84 years).</td>\n",
" <td>Frailty and multimorbidity</td>\n",
" <td>None</td>\n",
" <td>COVID-19 from the perspective of urban and rural general adult mental health services</td>\n",
" <td>2020-05-21 00:00:00</td>\n",
" <td>https://doi.org/10.1017/ipm.2020.62</td>\n",
" <td>At-risk groups among staff members and service users were identified early and prioritised in service changes.</td>\n",
" <td>At-risk groups among staff members and service users</td>\n",
" <td>rural regions</td>\n",
" </tr>\n",
" <tr>\n",
" <td>COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants</td>\n",
" <td>2020-07-23 00:00:00</td>\n",
" <td>https://www.ncbi.nlm.nih.gov/pubmed/32705587/</td>\n",
" <td>The age range at baseline within our sample was 40–69 years, but it is important to note that this was during 2006–2010, and the age range at COVID-19 diagnosis was substantially older (50–84 years).</td>\n",
" <td>Frailty and multimorbidity</td>\n",
" <td>Management of osteoarthritis during COVID‐19 pandemic</td>\n",
" <td>2020-05-21 00:00:00</td>\n",
" <td>https://doi.org/10.1002/cpt.1910</td>\n",
" <td>Consistently, a recent report indicated diabetes as a risk factor significantly associated with COVID-19 unfavourable clinical outcomes (37) .</td>\n",
" <td>sex, obesity, genetic factors and mechanical factors</td>\n",
" <td>None</td>\n",
" </tr>\n",
" </tbody>\n",
Expand Down
Loading

0 comments on commit 8ca4a1f

Please sign in to comment.