Explorer revamp #428

sal-uva · 2024-04-23T14:56:25Z

This PR revamps how the Explorer works and looks. It specifically does the following:

Adds a new OPTION_DATASOURCES_TABLE user input that creates a table with dynamic columns for each enabled dataset. Input fields per row can be text, dropdown, and checkbox
Uses this table to create a new Settings page where the Explorer can be enabled per data source (more table options can be added later).
Simplifies how custom data source templates for the Explorer are handled: they are now composed of CSS files (in static/css/explorer/) and Jinja2 templates (in webtool/templates/explorer/datasource-templates/) instead of CSS and JSON files in the data source folders that need to be verified and parsed.
Integrates the Explorer with the UI of 4CAT.
Makes the Explorer use iterate_items.
Re-integrate sorting so that all dataset columns can be used for sorting in the Explorer.
Enable new annotation columns for for sorting, filtering, and other features.
Adds Explorer templates for Twitter and Instagram (other data sources will soon follow).
Deletes much unnecessary code.

Note that some unused code is still present for future updates with respect to 4CAT scrapers and database-accessible data sources generally.

…or changes

…hodsinitiative/4cat into explorer-improvements # Conflicts: # common/lib/config_definition.py

…rer settings page

…s functionality, start Instagram template

dale-wahl

I looked over the backend code and only noticed one real issue (in an edge case). I ran this version and tested out the Explorer on a number of datasets (instagram, custom, telegram, tumblr, tiktok, youtube, reddit). It looks good! Sort works well. Reddit was missing the "subject" field (it's probably the only dataset that uses subject anymore). Telegram has an issue which I will post separately.

I tested saving annotations and writing them to datasets. This worked for me (and broke one with my edge case 😬; see comment). I did notice that the new fields show up in the Dataset preview view, but the values saved to the database do not show up in preview. The values do show up after you have run "write annotations".

Changing deactivating/activating settings seem to work fine. There is an explorerflask settings group that could probably be merged with the Explorer group.

If you want to merge now, I would deactivate Telegram as a default (till addressed) and consider how to address my comment re: field names for annotations.

backend/lib/processor.py

dale-wahl · 2024-04-23T18:40:32Z

datasources/ninegag/search_9gag.py

@@ -8,6 +8,7 @@

 from backend.lib.search import Search
 from common.lib.item_mapping import MappedItem
+from common.lib.helpers import UserInput


We're importing UserInput in a few datasources unnecessarily. Probably an oversight and otherwise has no effect.

True, we should do some cleanup..

dale-wahl · 2024-04-23T18:43:06Z

processors/filtering/write_annotations.py

@@ -101,7 +102,7 @@ def process(self):

 		# Write to top dataset
 		for label, values in new_data.items():
-			self.add_field_to_parent("annotation_" + label, values, which_parent=self.source_dataset, update_existing=True)
+			self.add_field_to_parent(label, values, which_parent=self.source_dataset, update_existing=True)


The add_field_to_parent function does not check for existing fields. If a User creates a field called "username" they will overwrite an existing field with the same name. If I recall, I could not figure out how to check that an existing column had the name because the add_field_to_parent function needs to be able to update existing annotation fields. This is just a bit dangerous.

Tested on a dataset by creating a field called "author", adding some values, and writing to dataset. I was able to overwrite the original "author" field (which in my case was actually a dictionary of author related data which caused map item to break). I recommend reverting this change. We could even add 4CAT_annotation_ or something so that it would be virtually impossible for raw data to contain that fieldname.

This is indeed an oversight for now, though I would like to have the option for annotation fields to have a 'clean' name; long names are quickly unreadable in spreadsheet software. This can be resolved by initially checking whether an annotation field key already exists in the dataset columns or, when annotated datasets are filtered and create a new dataset, if it is not a field registered in the annotations table for the parent dataset.

This is a sort-of edge case for now, but I'll try to resolve this next week!

Fixed this with a back-end and front-end check

dale-wahl · 2024-04-24T09:01:22Z

And this is the issue with a Telegram dataset I ran into:

File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 151, in render_template
2024-04-23 23:46:05     return _render(app, template, context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 132, in _render
2024-04-23 23:46:05     rv = template.render(context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 1301, in render
2024-04-23 23:46:05     self.environment.handle_exception()
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception
2024-04-23 23:46:05     raise rewrite_traceback_stack(source=source)
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 1, in top-level template code
2024-04-23 23:46:05     {% extends "layout.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/layout.html", line 71, in top-level template code
2024-04-23 23:46:05     {% block body %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 40, in block 'body'
2024-04-23 23:46:05     {% include "explorer/post.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/post.html", line 12, in top-level template code
2024-04-23 23:46:05     {% include "explorer/datasource-templates/generic.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/datasource-templates/generic.html", line 122, in top-level template code
2024-04-23 23:46:05     <i class="fa-solid fa-comment"></i> {{ fields.comments | commafy }}
2024-04-23 23:46:05   File "/usr/src/app/webtool/lib/template_filters.py", line 64, in _jinja2_filter_commafy
2024-04-23 23:46:05     number = int(number)
2024-04-23 23:46:05 ValueError: invalid literal for int() with base 10: '👍👍👍👍👍❤🏆🆒'

Looks like perhaps the emojis are killing the template. Telegram, I think, is the only datasource using them.

…already created

…ction

…hodsinitiative/4cat into explorer-improvements

…orer-improvements

# Conflicts: # setup.py # webtool/__init__.py # webtool/lib/template_filters.py # webtool/templates/components/result-result-row.html

# Conflicts: # common/lib/dataset.py # webtool/views/api_explorer.py

sal-uva · 2024-10-15T09:53:32Z

@stijn-uva this should be mergeable! Perhaps we want the OpenAI processor on a different branch for now since I'd like it to have some more features. But it works in its current state so also doesn't hurt to have it on master..

stijn-uva

Got an error while trying to save annotations, so I couldn't test everything:

ERROR:webtool:Exception on /explorer/save_annotations/a565f95306bcf5182adc023aa82a8d59 [POST]
Traceback (most recent call last):
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_limiter/extension.py", line 544, in __inner
    return obj(*a, **k)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_login/utils.py", line 290, in decorated_view
    return current_app.ensure_sync(func)(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
    return func(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
    return func(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/views/views_explorer.py", line 214, in explorer_save_annotations
    annotations_saved = dataset.save_annotations(annotations, overwrite=True)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/dataset.py", line 1787, in save_annotations
    annotation = Annotation(data=annotation_data, db=self.db)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 148, in __init__
    self.write_to_db()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 223, in write_to_db
    return self.db.upsert("annotations", data=db_data, constraints=["label", "dataset", "item_id"])
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/database.py", line 270, in upsert
    cursor.execute(query, replacements)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/psycopg2/extras.py", line 236, in execute
    return super().execute(query, vars)
psycopg2.errors.InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification

Some other things I noted:

I kind of liked the always-in-view toolbar in the old annotations layout, so I could scroll down the page and then open annotations, instead of having to scroll back up to do so
The 'save annotations' button saves so quickly that it's sometimes hard to know if the annotations were actually saved. Maybe a 'Annotations saved!' tooltip-like notice that fades away after a second would be useful
Maybe annotation fields should be shown by default after saving/editing fields? You would usually want to see them anyway. I could even see a case for not hiding them, ever, and just making them always visible (since annotation is one of the main purposes of the explorer view anyway)

stijn-uva · 2024-12-11T12:03:02Z

datasources/instagram/search_instagram.py

+        # Instagram posts also allow 'Collabs' with up to one co-author
+        coauthor = {"coauthor": "", "coauthor_fullname": "", "coauthor_id": ""}
+        if node.get("coauthor_producers"):
+            coauthor_node = node["coauthor_producers"][0]


Instagram posts can have multiple co-authors, could we e.g. make this a comma-separated list? Else we give the impression that there is less data than there really is

stijn-uva · 2024-12-11T12:05:26Z

processors/machine_learning/gpt.py

+		except ValueError:
+			self.dataset.finish_with_error("Max tokens must be a number")
+
+		self.dataset.delete_parameter("api_key")  # sensitive, delete after use


Better to set "sensitive": True for this option, then this is taken care of internally

stijn-uva · 2024-12-11T12:22:28Z

common/lib/dataset.py

+		annotations = []
+
+		# Get annotation IDs first
+		if item_id:


I see that the Annotation constructor can also take a data dictionary to avoid having to re-query the database - could that be used here instead of first querying only the id and then again the full annotation data when the object is instantiated on line 1681?

stijn-uva · 2024-12-11T12:33:55Z

common/lib/dataset.py

+
+		# We're saving the new annotation fields as-is.
+		# Ordering of fields is preserved this way.
+		self.db.execute("UPDATE datasets SET annotation_fields = %s WHERE key = %s;", (json.dumps(new_fields), self.key))


Consider using self.db.update or relying on DataSet.__setattr__ i.e. just set self.annotation_fields and let the class handle the rest.

stijn-uva · 2024-12-11T12:34:41Z

webtool/lib/template_filters.py

+	# Base URLs after which tags and @-mentions follow, per platform
+	base_urls = {
+		"twitter": {
+			"hashtag": "https://twitter.com/hashtag/",


x.com? Seems to be the canonical domain now

sal-uva added 23 commits April 8, 2024 12:46

First commit :)

1d3f587

Use regular iterate_items method when looping through dataset + min…

c4a4606

…or changes

Change wording in Explorer settings

cac644e

Allow Explorer CSS to be inserted and changed in Settings

8b78452

Move around Explorer CSS files

0fe3ea6

Edit custom Explorer CSS options

e06760a

Forgot to save these

a921967

Typozzz

e37ebc9

First setup for dynamic Explorer options in Settings

a7668f0

First steps to datasource table user input

59e33b0

Merge branch 'explorer-improvements' of https://github.com/digitalmet…

712aff1

…hodsinitiative/4cat into explorer-improvements # Conflicts: # common/lib/config_definition.py

Add basic UserInput.DATASOURCES_TABLE functionality, and use in Explo…

46628c6

…rer settings page

Simplify config setting name

340d1ff

Only show Explorer when enabled per data source

dfbe5f3

First steps in integrating the Explorer more with the main interface

28abb42

First steps in bringing back sorting

70d00b1

More sorting stuff

e937362

Fix and simplify sorting, control box styling

11eaaf9

Style and fix annotation field editor, enable config settings for CSS

c33fd72

Fix annotation saving, improve CSS inclusions

00993ed

Make sure annotations are kept in NDJSON and CSV, change custom field…

7149a6d

…s functionality, start Instagram template

Improve Instagram template, add location fields to Instagram search

93460ba

Simplify template settings, add Twitter and Instagram template

6929986

sal-uva requested a review from dale-wahl April 23, 2024 14:59

sal-uva added 3 commits April 23, 2024 17:57

Merge remote-tracking branch 'origin/master' into explorer-improvements

7d81a79

Remove prints

08735b7

Don't prepend 'annotations'

759b36a

dale-wahl reviewed Apr 24, 2024

View reviewed changes

Don't commafy post body in generic Explorer template

0cf2ccd

sal-uva and others added 27 commits August 30, 2024 19:28

Don't print in perspective processor

4fcafb9

Move perspective processor

9d322df

Typo in Tumblr search

057dd14

Add GPT processor

05e52c0

Fix GPT processor and make it compatible with any NDJSON or CSV file

cfc1e0e

Don't fail migrate script in edge case when new annotations table is …

b7ba19a

…already created

Improve GPT Prompting processor, add some more error handling and fri…

94f415d

…ction

Don't save annotations when no changes are made.

a75c64d

Space out tweets better in Explorer template

ae0712b

Merge branch 'explorer-improvements' of https://github.com/digitalmet…

8b63121

…hodsinitiative/4cat into explorer-improvements

No spacy

2bbb83c

Merge remote-tracking branch 'origin/explorer-improvements' into expl…

e0bbb83

…orer-improvements

Rename Explorer migrate script

141b7af

Merge remote-tracking branch 'origin/master' into explorer-improvements

20925a9

# Conflicts: # setup.py # webtool/__init__.py # webtool/lib/template_filters.py # webtool/templates/components/result-result-row.html

Merge branch 'refs/heads/master' into explorer-improvements

1f9e1c3

Include openai library in setup

09d1551

Fix merge errors in result-result-row.html

b914687

Add option to add a custom (fine-tuned) model to GPT processor.

4d3cc1c

Change wording and compatibility of Annotation metadata processor

e07d748

Change settings and wording for OpenAI LLM processor

fa4d236

Allow Google API key in config

754b339

Better error handling for Perspective processor

e37db4f

Strong no longer

c6f587a

Get rid of some unncessary code in Perspective processor

f4b108e

Merge branch 'master' into explorer-improvements

024875e

# Conflicts: # common/lib/dataset.py # webtool/views/api_explorer.py

Fix merge issues in Explorer API and Telegram search

b90bb3e

Merge branch 'master' into explorer-improvements

e5dba0e

Merge branch 'master' into explorer-improvements

ba0bd6e

stijn-uva reviewed Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explorer revamp #428

Explorer revamp #428

sal-uva commented Apr 23, 2024

dale-wahl left a comment

dale-wahl Apr 23, 2024

sal-uva Apr 24, 2024

dale-wahl Apr 23, 2024

dale-wahl Apr 23, 2024

sal-uva Apr 24, 2024

sal-uva Apr 30, 2024

dale-wahl commented Apr 24, 2024

sal-uva commented Oct 15, 2024

stijn-uva left a comment •

edited

Loading

stijn-uva Dec 11, 2024

stijn-uva Dec 11, 2024

stijn-uva Dec 11, 2024

stijn-uva Dec 11, 2024

stijn-uva Dec 11, 2024

Explorer revamp #428

Are you sure you want to change the base?

Explorer revamp #428

Conversation

sal-uva commented Apr 23, 2024

dale-wahl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dale-wahl commented Apr 24, 2024

sal-uva commented Oct 15, 2024

stijn-uva left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stijn-uva left a comment •

edited

Loading