Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Characters encoding in SQL server on AWS server #18

Open
nirbarazida opened this issue Jul 29, 2020 · 5 comments · Fixed by #22
Open

Characters encoding in SQL server on AWS server #18

nirbarazida opened this issue Jul 29, 2020 · 5 comments · Fixed by #22
Assignees
Labels
bug Something isn't working

Comments

@nirbarazida
Copy link
Owner

when running the code on a local server our Char encoding default is UTF-8mb4 (that's 4 bytes)
The AWS Char encoding default is UTF-8 (that's 3 bytes) as shown in the picture.
In this case, special characters (Chinese, Russian, figures, etc.) could not be encoded Thus the program crushes.
To solve the issue we can choose 2 options:

  1. change the encoding with every connection in the pymysql connection (shown in the picture)
  2. create an init connection value in the AWS MySQL server.

image

@nirbarazida nirbarazida added the bug Something isn't working label Jul 29, 2020
@nirbarazida nirbarazida self-assigned this Jul 29, 2020
@nirbarazida
Copy link
Owner Author

The solution is actually to encode the characters at the point of creating the engine as shown in the picture below

code that was added : '?charset=utf8mb4'

final code:

engine = create_engine(f"{config.SQL_EXTENSION}+{config.PYTHON_DBAPI}://{config.USER_NAME}:"
                           f"{config.PASSWORD}@localhost/{config.DB_NAME}?charset=utf8mb4")

image

@nirbarazida
Copy link
Owner Author

Although I succeeded one time to insert a user with a special string that requires UTS-8mb4 to the database - it didn't occur again.

good:
image
bad:
image

I have tried multi disciplines to solve the problem, however none have worked.
From what I've searched, this is a known problem.
I've decided to solve the problem by encoding to UTF-8 all string that Stack Exchange users can freely write.

all changes have been made in user class.
all new variables have been created in the json file and config class.

image

@nirbarazida
Copy link
Owner Author

for some users, all characters require encoding - Thus getting AttributeError.
to solve this problem used Try and except.
because the user name isn't necessary for the data Analysis we will change it to the website name + user rank

After conducting an experiment on scraping 500 users the code work fine.

image

@nirbarazida
Copy link
Owner Author

location created an issue when queering the database to check if location string excesses in Stack_Exchange_Location Table.
Thus, needed to encode the location string right after scraping it.

image

@nirbarazida nirbarazida mentioned this issue Jul 30, 2020
Merged
@nirbarazida nirbarazida reopened this Aug 1, 2020
@nirbarazida
Copy link
Owner Author

This time the program crashed because of using different types of letters.
Added another 'Bandage' just to keep the server running by checking if the string is printable:

image

we must figure how to encode utf-8mb4 on the AWS server.
I'm am sure that the solution below will work:
image

green - the solution - encoding the database when creating it.
yellow - insurance - encoding specific tables and columns.

I think that you've changed the MySQL config to latin1 thus it doesn't work.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants