Homework 1 - text generation #3

sevinjyolchuyeva · 2017-10-08T21:47:37Z

Could you please write continue with that Exercise 3.2. (Define a text generator function)

word='abcabcda'
toy_freqs = count_ngram_freqs("abcabcda", 3) : {'abc': 2, 'bca': 1, 'cab': 1, 'bcd': 1, 'cda': 1}

How should we use probability? That probability is number include [0,1] and what is condition of using it?

DanielLaszlo · 2017-10-08T22:21:22Z

Based on your example, let's assume that you have the following 2-grams:
{'ab': 2, 'bc': 2, 'ca': 1, 'cd': 1, 'da': 1}

So if e.g. you have a sequence 'bc' and you want to generate the next character for this sequence, you just look at the 3-grams, which start with 'bc'. These are:
{'bca': 1, 'bcd': 1}
and of course you also have among the 2-grams:
{'bc': 2}
so the probability of generating the character 'a' given the sequence 'bc' is defined as:
P(a|bc) = freq(bca) / freq(bc) = 1 / 2
Similarly the probability of generating the character 'd' given 'bc' is:
P(d|bc) = freq(bcd) / freq(bc) = 1 / 2

So whenever you encounter that the already generated sequence ends with 'bc' half the time you should generate the character 'a' and the other half the character 'd'.

sevinjyolchuyeva · 2017-10-10T14:44:02Z

Thank for answering. In that situation, function output should be 'bcd' or 'bca' ?
It means, we had 2-grams and we generated 3-grams. Is it true?
For the exercise, gen = generate_text("abc", 5, toy_freqs, 3) means that we should generate 5-grams given 3-grams ( or 2-grams) ??

juditacs · 2017-10-10T14:56:28Z

Not exactly.

5 is the length of the desired output. It could be much longer than 5 and you should test your solution for larger values such as 200 or 300.

N is the base of the generation. If N=3 and the string ends with bc then the distribution used for sampling the next character is the distribution of all trigrams (3-grams) that start with bc. Changing @DanLszl example a little bit, let's assume that we find the following trigrams starting with bc:

{'bca': 2, 'bcb': 1, 'bcc': 1}

you should generate a with probability 0.5 and b and c with probability 0.25 each (so a is generated 1 out of 2 times and b and c 1 out of 4 times on average). I changed the example to demonstrate that uniform sampling is NOT correct, not all trigrams are equally probable.

You don't need to and shouldn't generate longer n-grams than N.

sevinjyolchuyeva · 2017-10-13T12:55:14Z

Thank so much.

DavidNemeskey added question help wanted and removed question labels Oct 9, 2017

juditacs added the homework label Oct 10, 2017

juditacs changed the title ~~Homework 1~~ Homework 1 - text generation Oct 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Homework 1 - text generation #3

Homework 1 - text generation #3

sevinjyolchuyeva commented Oct 8, 2017

DanielLaszlo commented Oct 8, 2017

sevinjyolchuyeva commented Oct 10, 2017

juditacs commented Oct 10, 2017

sevinjyolchuyeva commented Oct 13, 2017

Homework 1 - text generation #3

Homework 1 - text generation #3

Comments

sevinjyolchuyeva commented Oct 8, 2017

DanielLaszlo commented Oct 8, 2017

sevinjyolchuyeva commented Oct 10, 2017

juditacs commented Oct 10, 2017

sevinjyolchuyeva commented Oct 13, 2017