Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Homework 1 - text generation #3

Open
sevinjyolchuyeva opened this issue Oct 8, 2017 · 4 comments
Open

Homework 1 - text generation #3

sevinjyolchuyeva opened this issue Oct 8, 2017 · 4 comments

Comments

@sevinjyolchuyeva
Copy link

Could you please write continue with that Exercise 3.2. (Define a text generator function)

word='abcabcda'
toy_freqs = count_ngram_freqs("abcabcda", 3) : {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1}

How should we use probability? That probability is number include [0,1] and what is condition of using it?

@DanielLaszlo
Copy link

Based on your example, let's assume that you have the following 2-grams:
{'ab': 2, 'bc': 2, 'ca': 1, 'cd': 1, 'da': 1}

So if e.g. you have a sequence 'bc' and you want to generate the next character for this sequence, you just look at the 3-grams, which start with 'bc'. These are:
{'bca': 1, 'bcd': 1}
and of course you also have among the 2-grams:
{'bc': 2}
so the probability of generating the character 'a' given the sequence 'bc' is defined as:
P(a|bc) = freq(bca) / freq(bc) = 1 / 2
Similarly the probability of generating the character 'd' given 'bc' is:
P(d|bc) = freq(bcd) / freq(bc) = 1 / 2

So whenever you encounter that the already generated sequence ends with 'bc' half the time you should generate the character 'a' and the other half the character 'd'.

@sevinjyolchuyeva
Copy link
Author

Thank for answering. In that situation, function output should be 'bcd' or 'bca' ?
It means, we had 2-grams and we generated 3-grams. Is it true?
For the exercise, gen = generate_text("abc", 5, toy_freqs, 3) means that we should generate 5-grams given 3-grams ( or 2-grams) ??

@juditacs
Copy link
Collaborator

Not exactly.

5 is the length of the desired output. It could be much longer than 5 and you should test your solution for larger values such as 200 or 300.

N is the base of the generation. If N=3 and the string ends with bc then the distribution used for sampling the next character is the distribution of all trigrams (3-grams) that start with bc. Changing @DanLszl example a little bit, let's assume that we find the following trigrams starting with bc:

{'bca': 2, 'bcb': 1, 'bcc': 1}

you should generate a with probability 0.5 and b and c with probability 0.25 each (so a is generated 1 out of 2 times and b and c 1 out of 4 times on average). I changed the example to demonstrate that uniform sampling is NOT correct, not all trigrams are equally probable.

You don't need to and shouldn't generate longer n-grams than N.

@juditacs juditacs changed the title Homework 1 Homework 1 - text generation Oct 10, 2017
@sevinjyolchuyeva
Copy link
Author

Thank so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants