Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add obfuscator - Second Draft #202

Open
Tony911029 opened this issue Dec 3, 2024 · 0 comments
Open

Add obfuscator - Second Draft #202

Tony911029 opened this issue Dec 3, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Tony911029
Copy link
Collaborator

Tony911029 commented Dec 3, 2024

State of first draft data obfuscation:

  • We have a logging obfuscation function where we simulate the behaviours of patients logging their meal
  1. All meals - keep all meals for now

  2. Multiple meals per day (1-2 largest meals) - Find a threshold so that we have an average of 1.8 meals logged per day

  3. Once per day (largest meal) - Find the largest one in a day

  4. A few times per week - Find a threshold so that we have an average of 3 meals logged per week

  5. Never - Wipe all data

  • We have a logging timing habit function where we simulate the habits of patients logging when theyare actually log their meals.
  1. Temporally right skewed -> forgetful loggers - Gamma function with right-skewed. Fixed value distribution with minor randomness.

  2. Temporally left skewed -> hasty loggers - Gamma function with left-skewed (less skewed because a patient probably won't log their meal too early most of the time) - Fixed value distribution with minor randomness.

  3. Normal Distribution - Gaussian distribution with fixed valued spread

  4. Unchanged

Data flow:

data/raw/sim -> logging obfuscation function to create msg_type_log -> logging timing habit function to create 'msg_type_log_shiftedfrommsg_type_log->data/raw/obfuscated`

Improvement:

  1. Find out the right distribution between each type of user for both functions. For example, loggers who might log all of their meal consist of 25% rather than 30%.

  2. Fine-tune the default distribution (we need a better param for gamma distribution to reflect the true behaviour of patients) or find a better distribution.

  3. Left and right skewed distribution should be different. For hasty loggers, maybe on average, they log their meals 10 mins early and probably wouldn't be longer than that but for forgetful loggers, it may go up to >40 mins.

  4. Remove the original csv file when generating a new file name (bug)

  5. Investigate new line characters at the end of some files (bug?)

  6. Clean up columns from the simulation_data_generation script. We have Unnamed: 0 column maybe we should have dropped it.

@Tony911029 Tony911029 self-assigned this Dec 3, 2024
@RobotPsychologist RobotPsychologist added the enhancement New feature or request label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants