Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
principle105 authored Oct 28, 2022
1 parent e5e1e3f commit ea59ef9
Showing 1 changed file with 8 additions and 77 deletions.
85 changes: 8 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ This package can be installed from [PyPi](https://pypi.org/project/thomasthechat
pip install thomasthechatbot
```

## CLI

Type `ttc` to begin talking to Thomas.

# How does Thomas work?

I wrote a [medium article](https://medium.com/@principle105/creating-a-python-chatbot-that-learns-as-you-speak-to-it-60b305d8f68f) to explain how Thomas works.

# Usage

## Basic Usage
Expand Down Expand Up @@ -63,83 +71,6 @@ chatbot = Chatbot(
)
```

## CLI

Type `ttc` to begin talking to Thomas.

# How does Thomas work?

Thomas has no hard-coded responses and is designed to “learn” as he is spoken to.

Note: I created this approach based on my intuition and not a proven method.

## Data Storage

Thomas does not come up with his own responses, he reiterates those that he has seen before.

### Responses

Previous responses are stored in `resps.json` as a dictionary where the key is a generated [UUID](https://docs.python.org/3/library/uuid.html) and the value is the tokenized response.

### Mesh

Prompts are associated with responses through a "mesh" which is stored in `mesh.json`. The mesh consists of a dictionary where the key is the UUID of the prompt and the value is a "link". Links associate responses to patterns of words, they have the following attributes:

`stop_words: set`
Stop words separated from the tokenized prompt.

`keywords: set`
The remaining words which are lemmatized by their part of speech.

`resps: dict[str, set]`
Responses to the prompt where the key is the response UUID and the value is a set of mesh ids from the previous prompt.

## Querying Responses

### Tokenizing Prompts

Before tokenization, prompts are lowercased, contractions are expanded and punctuation is removed. This aids in improving the consistency and accuracy of queries. Prompts are tokenized by word and split into key words and stop words.

### Ignoring Responses

The user's prompt and chatbot's previous response are ignored to prevent the chatbot from appearing repetitive.

### Initial Query

Meshes are initially queried by their score which can be calculated with:

`(ss / 2 + sk) / (ts / 2 + tk - ss / 2 - sk + 1)`

`ss` = shared stop words

`sk` = shared key words

`ts` = total stop words

`tk` = total key words

This formula weighs shared key words 2 times more heavily than stop words by dividing `ss` and `sk` by 2. It also takes into account the total number of words resulting in more precise meshes being favoured.

### First Discard

Meshes with scores below a threshold (`min_score`) are discarded.

### No Results Queried

If no results remain, meshes are queried by the number of shared stop words.

### Second Discard

The remaining meshes are sorted and meshes that fall below a percentage threshold (`score_threshold`) of the best score are discarded. Considering multiple meshes increases the variety of responses.

### Mesh Association

Meshes are associated with each other by the percentage of shared responses (`mesh_association`). Associated meshes for each queried mesh are found and added to the list. This process prevents less trained prompts from having a small response pool.

### Choosing a Response

If responses are found to share the same previous message UUID as the prompt, all non-sharing responses are moved. Responses are chosen at random from the remaining responses. Random selection prevents the chatbot from being predictable.

# Contributing

Open to contributions, please create an issue if you want to do so.
Expand Down

0 comments on commit ea59ef9

Please sign in to comment.