Parsing the xP3 dataset #23

hmubarak · 2023-11-19T12:38:28Z

Q1: I am trying to extract the Arabic instructions from the xP3 dataset, and I want to put them in the format: “Instruction”, “Input”, and “Output”. Currently, the data is in this format: “inputs” and “targets”.

I found that the instructions sometimes are in the last part of the “inputs” and preceded by \n, and sometimes without any delimiter. In other cases, the instructions are at the beginning of the “inputs”, etc.

Here is an example where the instruction is at the end of the input, but without any delimiter to recognize it.
File: xp3_GEM_xlsum_arabic_train_xp3longrest.jsonl
{"inputs":"...\nووسط هذه القلة يقف أيضا شقيقيها الفنان فيصل لعيبي، الذي أثر كثيرا في تطورها الفني وباتت تشكل معه ثنائيا فنيا مميزا، يجعل من أعمالهما في حوار دائم، فتحمل كثيرا من الوشائج والتشابهات الأسلوبية والشكلية لكنها تفترق في التوجه. ففي الوقت الذي يسعى فيصل إلى تأصيل فنه في قلب منجز الرسم العراقي بالتركيز على الخصوصية العراقية واللمسة المحلية والنهل من التراث الفني الرافديني في مراحله المختلفة وعكسه بلغة فنية حداثوية معاصرة، تسعى عفيفة إلى تمييز نفسها عنه بالتحليق في فضاء إنساني عام، مبعدة لوحاتها عن أي ... Write the rest of the article:","targets":"حوارا مع نظرتها ويركز عليها ليكتشف أنها تنظر في مكان آخر أو ربما في ماضٍ بعيد. \n\nوتعنى عفيفة باختيار

Q2: I found many incomplete inputs and outputs, ex: having this string:
"... Continue the article for another 4000 characters max:","targets":"."}
What should we do in such cases?
Thanks
Hamdy Mubarak

Muennighoff · 2023-11-19T15:58:53Z

Q1: You can either rely on heuristics - With arabic I guess you can just look for the first latin letters or you reprocess the dataset and separate the instructions

Q2: You can remove them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing the xP3 dataset #23

Parsing the xP3 dataset #23

hmubarak commented Nov 19, 2023

Muennighoff commented Nov 19, 2023

Parsing the xP3 dataset #23

Parsing the xP3 dataset #23

Comments

hmubarak commented Nov 19, 2023

Muennighoff commented Nov 19, 2023