Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explorer revamp #428

Open
wants to merge 219 commits into
base: master
Choose a base branch
from
Open

Explorer revamp #428

wants to merge 219 commits into from

Conversation

sal-uva
Copy link
Collaborator

@sal-uva sal-uva commented Apr 23, 2024

This PR revamps how the Explorer works and looks. It specifically does the following:

  • Adds a new OPTION_DATASOURCES_TABLE user input that creates a table with dynamic columns for each enabled dataset. Input fields per row can be text, dropdown, and checkbox
  • Uses this table to create a new Settings page where the Explorer can be enabled per data source (more table options can be added later).
  • Simplifies how custom data source templates for the Explorer are handled: they are now composed of CSS files (in static/css/explorer/) and Jinja2 templates (in webtool/templates/explorer/datasource-templates/) instead of CSS and JSON files in the data source folders that need to be verified and parsed.
  • Integrates the Explorer with the UI of 4CAT.
  • Makes the Explorer use iterate_items.
  • Re-integrate sorting so that all dataset columns can be used for sorting in the Explorer.
  • Enable new annotation columns for for sorting, filtering, and other features.
  • Adds Explorer templates for Twitter and Instagram (other data sources will soon follow).
  • Deletes much unnecessary code.

Note that some unused code is still present for future updates with respect to 4CAT scrapers and database-accessible data sources generally.

sal-uva added 23 commits April 8, 2024 12:46
…hodsinitiative/4cat into explorer-improvements

# Conflicts:
#	common/lib/config_definition.py
@sal-uva sal-uva requested a review from dale-wahl April 23, 2024 14:59
Copy link
Member

@dale-wahl dale-wahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked over the backend code and only noticed one real issue (in an edge case). I ran this version and tested out the Explorer on a number of datasets (instagram, custom, telegram, tumblr, tiktok, youtube, reddit). It looks good! Sort works well. Reddit was missing the "subject" field (it's probably the only dataset that uses subject anymore). Telegram has an issue which I will post separately.

I tested saving annotations and writing them to datasets. This worked for me (and broke one with my edge case 😬; see comment). I did notice that the new fields show up in the Dataset preview view, but the values saved to the database do not show up in preview. The values do show up after you have run "write annotations".

Changing deactivating/activating settings seem to work fine. There is an explorerflask settings group that could probably be merged with the Explorer group.

If you want to merge now, I would deactivate Telegram as a default (till addressed) and consider how to address my comment re: field names for annotations.

backend/lib/processor.py Outdated Show resolved Hide resolved
@@ -8,6 +8,7 @@

from backend.lib.search import Search
from common.lib.item_mapping import MappedItem
from common.lib.helpers import UserInput
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're importing UserInput in a few datasources unnecessarily. Probably an oversight and otherwise has no effect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, we should do some cleanup..

@@ -101,7 +102,7 @@ def process(self):

# Write to top dataset
for label, values in new_data.items():
self.add_field_to_parent("annotation_" + label, values, which_parent=self.source_dataset, update_existing=True)
self.add_field_to_parent(label, values, which_parent=self.source_dataset, update_existing=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The add_field_to_parent function does not check for existing fields. If a User creates a field called "username" they will overwrite an existing field with the same name. If I recall, I could not figure out how to check that an existing column had the name because the add_field_to_parent function needs to be able to update existing annotation fields. This is just a bit dangerous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a dataset by creating a field called "author", adding some values, and writing to dataset. I was able to overwrite the original "author" field (which in my case was actually a dictionary of author related data which caused map item to break). I recommend reverting this change. We could even add 4CAT_annotation_ or something so that it would be virtually impossible for raw data to contain that fieldname.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed an oversight for now, though I would like to have the option for annotation fields to have a 'clean' name; long names are quickly unreadable in spreadsheet software. This can be resolved by initially checking whether an annotation field key already exists in the dataset columns or, when annotated datasets are filtered and create a new dataset, if it is not a field registered in the annotations table for the parent dataset.

This is a sort-of edge case for now, but I'll try to resolve this next week!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this with a back-end and front-end check

@dale-wahl
Copy link
Member

And this is the issue with a Telegram dataset I ran into:

File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 151, in render_template
2024-04-23 23:46:05     return _render(app, template, context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/flask/templating.py", line 132, in _render
2024-04-23 23:46:05     rv = template.render(context)
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 1301, in render
2024-04-23 23:46:05     self.environment.handle_exception()
2024-04-23 23:46:05   File "/opt/venv/lib/python3.8/site-packages/jinja2/environment.py", line 936, in handle_exception
2024-04-23 23:46:05     raise rewrite_traceback_stack(source=source)
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 1, in top-level template code
2024-04-23 23:46:05     {% extends "layout.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/layout.html", line 71, in top-level template code
2024-04-23 23:46:05     {% block body %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/explorer.html", line 40, in block 'body'
2024-04-23 23:46:05     {% include "explorer/post.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/post.html", line 12, in top-level template code
2024-04-23 23:46:05     {% include "explorer/datasource-templates/generic.html" %}
2024-04-23 23:46:05   File "/usr/src/app/webtool/templates/explorer/datasource-templates/generic.html", line 122, in top-level template code
2024-04-23 23:46:05     <i class="fa-solid fa-comment"></i> {{ fields.comments | commafy }}
2024-04-23 23:46:05   File "/usr/src/app/webtool/lib/template_filters.py", line 64, in _jinja2_filter_commafy
2024-04-23 23:46:05     number = int(number)
2024-04-23 23:46:05 ValueError: invalid literal for int() with base 10: '👍👍👍👍👍❤🏆🆒'

Looks like perhaps the emojis are killing the template. Telegram, I think, is the only datasource using them.

sal-uva and others added 27 commits August 30, 2024 19:28
# Conflicts:
#	setup.py
#	webtool/__init__.py
#	webtool/lib/template_filters.py
#	webtool/templates/components/result-result-row.html
# Conflicts:
#	common/lib/dataset.py
#	webtool/views/api_explorer.py
@sal-uva
Copy link
Collaborator Author

sal-uva commented Oct 15, 2024

@stijn-uva this should be mergeable! Perhaps we want the OpenAI processor on a different branch for now since I'd like it to have some more features. But it works in its current state so also doesn't hurt to have it on master..

Copy link
Member

@stijn-uva stijn-uva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got an error while trying to save annotations, so I couldn't test everything:

ERROR:webtool:Exception on /explorer/save_annotations/a565f95306bcf5182adc023aa82a8d59 [POST]
Traceback (most recent call last):
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 2190, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1486, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_limiter/extension.py", line 544, in __inner
    return obj(*a, **k)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/flask_login/utils.py", line 290, in decorated_view
    return current_app.ensure_sync(func)(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
    return func(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/lib/helpers.py", line 332, in decorated_view
    return func(*args, **kwargs)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/webtool/views/views_explorer.py", line 214, in explorer_save_annotations
    annotations_saved = dataset.save_annotations(annotations, overwrite=True)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/dataset.py", line 1787, in save_annotations
    annotation = Annotation(data=annotation_data, db=self.db)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 148, in __init__
    self.write_to_db()
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/annotation.py", line 223, in write_to_db
    return self.db.upsert("annotations", data=db_data, constraints=["label", "dataset", "item_id"])
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/common/lib/database.py", line 270, in upsert
    cursor.execute(query, replacements)
  File "/Users/stijn/surfdrive/PycharmProjects/4cat/venv/lib/python3.9/site-packages/psycopg2/extras.py", line 236, in execute
    return super().execute(query, vars)
psycopg2.errors.InvalidColumnReference: there is no unique or exclusion constraint matching the ON CONFLICT specification

Some other things I noted:

  • I kind of liked the always-in-view toolbar in the old annotations layout, so I could scroll down the page and then open annotations, instead of having to scroll back up to do so
  • The 'save annotations' button saves so quickly that it's sometimes hard to know if the annotations were actually saved. Maybe a 'Annotations saved!' tooltip-like notice that fades away after a second would be useful
  • Maybe annotation fields should be shown by default after saving/editing fields? You would usually want to see them anyway. I could even see a case for not hiding them, ever, and just making them always visible (since annotation is one of the main purposes of the explorer view anyway)

# Instagram posts also allow 'Collabs' with up to one co-author
coauthor = {"coauthor": "", "coauthor_fullname": "", "coauthor_id": ""}
if node.get("coauthor_producers"):
coauthor_node = node["coauthor_producers"][0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instagram posts can have multiple co-authors, could we e.g. make this a comma-separated list? Else we give the impression that there is less data than there really is

except ValueError:
self.dataset.finish_with_error("Max tokens must be a number")

self.dataset.delete_parameter("api_key") # sensitive, delete after use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to set "sensitive": True for this option, then this is taken care of internally

annotations = []

# Get annotation IDs first
if item_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the Annotation constructor can also take a data dictionary to avoid having to re-query the database - could that be used here instead of first querying only the id and then again the full annotation data when the object is instantiated on line 1681?


# We're saving the new annotation fields as-is.
# Ordering of fields is preserved this way.
self.db.execute("UPDATE datasets SET annotation_fields = %s WHERE key = %s;", (json.dumps(new_fields), self.key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using self.db.update or relying on DataSet.__setattr__ i.e. just set self.annotation_fields and let the class handle the rest.

# Base URLs after which tags and @-mentions follow, per platform
base_urls = {
"twitter": {
"hashtag": "https://twitter.com/hashtag/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x.com? Seems to be the canonical domain now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants