Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV option is greyed out #56

Open
abdulrehmanmian opened this issue Aug 9, 2020 · 3 comments
Open

CSV option is greyed out #56

abdulrehmanmian opened this issue Aug 9, 2020 · 3 comments

Comments

@abdulrehmanmian
Copy link

I just got done with a 100 gb jsonl file but the csv option is greyed out, how to solve this?

@mihirp161
Copy link

mihirp161 commented Oct 5, 2020

That's huge for this app I believe, you may have to do it yourself. In case you don't have enough RAM memory, your best bet would be to read it in through a python or R environment in chunks then write that chunk to csv then clear the memory then repeat until final line (you can search online, a lot ways to do that). Or if you have limited memory, you can go ahead and use the Linux terminal (not sure of Windows, but there could be a similar method in Win OS too)-

Following command in Linux prompt will take the jsonl file and split it in 50K chunks.
split -l 50000 --additional-suffix=.jsonl *.jsonl ./FOLDER_WHERE_JSONL_FILE_IS/GIVE_OUTPUT_FILE_PREFIX_

I hope this helps. Good luck :-)

@rtrad89
Copy link

rtrad89 commented Oct 12, 2020

Is #51 related?

PS. You may have closed the Hydrator too soon. You need to give it time till the CSV option shows and then wait even more till it finishes converting the file after you click it. If you close it in the middle of the conversion process, it keeps deactivated no matter what.

@rtrad89
Copy link

rtrad89 commented Oct 12, 2020

That's huge for this app I believe, you may have to do it yourself. In case you don't have enough RAM memory, your best bet would be to read it in through a python or R environment in chunks then write that chunk to csv then clear the memory then repeat until final line (you can search online, a lot ways to do that). Or if you have limited memory, you can go ahead and use the Linux terminal (not sure of Windows, but there could be a similar method in Win OS too)-

Following command in Linux prompt will take the jsonl file and split it in 50K chunks.
split -l 50000 --additional-suffix=.jsonl *.jsonl ./FOLDER_WHERE_JSONL_FILE_IS/GIVE_OUTPUT_FILE_PREFIX_

I hope this helps. Good luck :-)

Here's a basic snippet of code in Python 3x -- just replace [INPUT] with your jsonl filename, and insert a desirable name for the output csv in place of [OUTPUT]

# -*- coding: utf-8 -*-
"""
Adapted from https://stackoverflow.com/a/46653313/3429115
"""

import json
import csv
import io
from datetime import datetime

'''
creates a .csv file using a Twitter .json file
the fields have to be set manually
'''

def extract_json(fileobj):
    """
    Iterates over an open JSONL file and yields
    decoded lines.  Closes the file once it has been
    read completely.
    """
    with fileobj:
        for line in fileobj:
            yield json.loads(line)    


data_json = io.open('tweets_20200501-V2.jsonl', mode='r', encoding='utf-8') # Opens in the JSONL file
data_python = extract_json(data_json)

csv_out = io.open('tweets_20200501.csv', mode='w', encoding='utf-8') #opens csv file


fields = u'id,created_at,reweet_id,user_screen_name,user_followers_count,user_friends_count,retweet_count,favourite_count,text' #field names
csv_out.write(fields)
csv_out.write(u'\n')

print(f"{datetime.utcnow()}: Output file created. Starting conversion..")

for i, line in enumerate(data_python):

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('id_str'),
           line.get('created_at'),
           line.get('retweeted_status').get('id_str') if line.get('retweeted_status') is not None else "",
           line.get('user').get('screen_name'),  
           str(line.get('user').get('followers_count')),
           str(line.get('user').get('friends_count')),
           str(line.get('retweet_count')),
           str(line.get('favorite_count')),
           '"' + line.get('full_text').replace('"','""') + '"', #creates double quotes
           ]
    
    if i%100000 == 0 and i > 0:
        print(f"{datetime.utcnow()}: {i} tweets done...")

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')

print("All tweets done. Saving the csv...")
csv_out.close()
print("Done.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants