Skip to content

Getting Started

David Jurgens edited this page Jun 20, 2021 · 22 revisions

Step 0: Download Potato

git clone https://github.com/davidjurgens/potato.git

Step 1: Set up your data for annotation

Potato supports json, csv, and tsv files as input. Json input needs to be in jsonlines format (one instance per line). Potato needs two specific things for each item to annotate:

  • Some column/key that can serve as a unique id for the instance
  • Some column/key with the text for the instance

If you have data already prepared, you do not need to change the column or key names in your input files.

Step 2: Configure your Potato instance

Potato is completely driven by a config file that you will design. We've provided an example yaml file with some sensible defaults to get you started. Copy that file and we'll fill it in:

cd $POTATO_HOME # wherever you've cloned potato
cp templates/examples/empty-to-copy.yaml myconfig.yaml

There are many configuration options (described on the Configuration File wiki) but we'll go through the minimum needed to get up and running.

Specify your data

In your config.yaml file, under item_properties, you will specify which column or keys in the input file match your file. For example, if annotating Twitter messages from the streaming API (one JSON tweet per line), our config file might have the following

"item_properties": {
   "id_key": "id_str",
   "text_key": "text",
}

Note that the id_key value must be a string (e.g., you can't use int values as ids).

Once you have your data specified, update the data_files field in your config.yaml to contain the file(s) with your data. For example, our config might have:

"data_files": [
    "my-data/tweets-from-tues.json",
    "some-other-data/bonus-data/tweets-from-mon.tsv"
],

All data files need to have the same keys for their id and text, but you can mix formats (e.g., json/tsv).

(Optional) Choose a custom HTML template

If you're just aiming to annotate a piece of text with a single scale (or maybe 2?) you don't need to edit anything here or change the config file from its default:

    "html_layout": "templates/examples/plain_layout.html",

The default setting will use plain-layout.html which places the item's text at the top and the annotation task below. this will look fine for most simple tasks.

If you need to get fancier, see the HTML Layout guide for how to customize the visual appearance of your task. This will let you move around elements and add more custom non-annotation content to the display or include more complex designs like annotate pairs of items.

Set up your annotation task

Potato supports annotating text with one or more different annotation types, e.g., likert scales, multiple choice, or best-worst scaling. In your config file, you'll specify which kind of annotation task(s) you want to use. Specically, you'll set the annotation_schemes field in myconfig.yaml, which holds the list of all the ways in which you're annotating data. Each annotation scheme is a yaml object that specifies a type of annotation. For example, let's say we wanted to annotate our tweets for offensiveness using a Likert scale. Our config.yaml file might look like this:

        {
            "annotation_type": "likert",

            # This name gets used in reporting the annotation results
            "name": "offensiveness",

            # This text is shown to the user and can be a longer statement
            "description": "How offensive is this message?",

            # The min and max labels are text shown at each end of the scale
            "min_label": "Inoffensive",
            "max_label": "Very Offensive",

            # How many scale points to show
            "size": 5,

            # This will bind keys [1:size] to our scale responses.
            "sequential_key_binding": True,
        }       

There are many supported annotation types and the Annotation Schema wiki describes the different types and options for each.

Choose an Output directory and format

Potato will generate aggregate annotations into a single file, as well as keep track of per-annotator annotations and state (e.g., which order they saw items) in separate directories. You should specify the output directory using the output_annotation_dir field. If you copied over the example, this will have a filled-in value, which you'll want to overwrite.

Potato supports multiple output formats. You can choose one using the output_annotation_format field. If we wanted to store our tweet-sentiment annotations in a csv format, our myconfig.yaml file would have the following:

    "output_annotation_format": "csv", 

This output file will have one row for each annotation, including the annotator's username (i.e., their email), the instance id, the values for each annotation scheme used in the task. In delimited format, each scheme's name will be a column; in json formats, the name and value will be a key/value in the json object.

Step 3: Launch Potato

  • If you want to run potato on the Blablablab servers, connect to UM’s VPN and login to the server you’re using. See Getting Started in the Blablablab handbook for details.
    • If you start potato on the servers, all of your annotators will need to be connected to UM’s VPN in order to access your annotation task.
    • However, the benefit of running the task from the server is that anyone on the UM VPN should be able to access it via the same URL. This is useful if you have multiple annotators.
    • Note: You can also do this on your own personal computer, in which case you don’t need the VPN; however, you’ll only be able to access the annotations from that laptop.
  • Start a new screen, using terminal:
    • Open terminal, login to the server if needed, and type screen
    • It may have you read through some terms - press space a few times until you land on an empty page
    • Note: this ensures that you will be able to continue to access the annotation url for as long as you want XX SCREENSHOT XX
  • Kick off a session for your annotations, using terminal:
    • Change directory so you’re in the potato-master folder (wherever you put it when downloading): cd path_to_folder/potato-master/
    • Launch potato: python3 potato/flask_server.py config/config_single.yaml
      • Soon changing to: python -m potato config/config_single.yaml
    • Possible errors and what to do:
      • Error: ModuleNotFoundError: No module named lib_name'
        • Cause: generally these are python libraries that haven’t been installed on your machine yet
        • Solution: if you have sufficient permissions on your laptop, install the library by typing in terminal pip install lib_name or pip3 install lib_name
      • Error: OSError: [Errno 98] Address already in use
        • Cause: this is generally because the port potato is trying to attach to (8000 by default) is already in use
        • Solution: specify which port to use in your command to launch potato: python3 potato/flask_server.py config/config_single.yaml -- port 8000
          • You can change from 8000 to something else
          • You may need to try a few port numbers before you find one that’s free; try larger numbers first, since the system usually allocates ports from smallest to largest
  • Towards the bottom of the output, it will give you the port potato is on - in this case, 8000. XX SCREENSHOT XX
  • On any laptop, open your preferred browser (e.g., Chrome) and connect to this port by navigating to the appropriate URL.
    • On your laptop: http://localhost:8000
    • On the servers: http://server_name.si.umich.edu:8000
      • Note: You’ll need to be connected to UM’s VPN in order to access potato on the servers. However, the benefit of running the task from the server is that anyone on the UM VPN should be able to access this URL and, if you’ve added their name to the users list, contribute to the annotations. This is useful if you have multiple annotators.
  • Keep your annotation task alive by detaching from the screen by clicking Ctrl + a then d. The annotation tool will remain alive (meaning you can return to it via the URL in the prior bullet) as long as the server is running and you haven’t killed the screen.
    • Note: when you detach, it’ll show you the name of the screen — which you’ll use to reconnect to the screen in the future. XX SCREENSHOT XX
    • You shouldn’t ever need to return to the screen to keep using potato. However, if you want to reopen the screen, login to the server and type screen -r name_of_screen
    • If you’ve forgotten the screen name from the prior step, you can type screen -x to see a list of screens you have running.

Step 4: Annotate Data

  • On any laptop, open your preferred browser (e.g., Chrome) navigate to the URL from the prior step.
    • On your laptop: http://localhost:8000
    • On the servers: http://curry.si.umich.edu:8000
      • Note: You’ll need to be connected to UM’s VPN in order to access potato on the servers. However, the benefit of running the task from the server is that anyone on the UM VPN should be able to access this URL and, if you’ve added their name to the users list, contribute to the annotations. This is useful if you have multiple annotators.
  • Enter your first and last name on the first page, when prompted. Remember, only the names you specified in the config_single.yaml file will take you to the next page.
    • Soon: can enter annotators on task setup instead of in the configuration file! XX SCREENSHOT XX
  • XX Coming Soon: list of features, output format

Step 5: Looking at your annotated data

TBD

(Optional) Step 6: Shutting down Potato

  • Once you’re done with the annotations, you can kill the screen, which will kill potato. This is necessary to release the port you were using.
    • Open terminal and login to the server if you are running potato there
    • Type screen -X -S name_of_screen kill
      • If you’ve forgotten the screen name from launching potato, you can type screen -x to see a list of screens you have running.