Skip to content

Commit

Permalink
Merge pull request #13 from logmanager-oss/implement-custom-anonymiza…
Browse files Browse the repository at this point in the history
…tion-mappings

implement custom anonymization mappings
  • Loading branch information
tender-barbarian authored Nov 22, 2024
2 parents 24cf4e5 + 41bd69a commit d0c6170
Show file tree
Hide file tree
Showing 12 changed files with 264 additions and 127 deletions.
110 changes: 71 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,15 +26,17 @@ Usage of ./logveil:
-d value
Path to directory with anonymizing data
-i value
Path to input file containing logs to be anonymized
Path to input file containing logs to be anonymized (mandatory - if you don't specify input, code will fail)
-o value
Path to output file (default: Stdout)
-c value
Path to input file with custom anonymization mapping
-v
Enable verbose logging
-e
Change input file type to LM export (default: LM Backup)
-p
Disable proof wrtier (default: Enabled)
Disable proof writer (default: Enabled)
-h
Help for logveil
```
Expand All @@ -61,72 +63,102 @@ Usage of ./logveil:

`./logveil -d example_anon_data/ -e -i lm_export.csv -p -v`

### How it works
6. Read log data from LM Export file (CSV), output anonymization result to standard output (STDOUT) and load custom mapping from custom_mapping.txt

**This is only a simplified example and does not match 1:1 with how anonymization is actually implemented**
`./logveil -d example_anon_data/ -e -i lm_export.csv -c custom_mapping.txt`

Consider below log line. It is formatted in a common `key:value` format.

## Anonymization functionality

There are three ways LogVeil anonymizes data:

### Custom anonymization mappings

You can provide custom anonymization mappings for LogVeil to use. They will take precedence over any other anonymization functionality.

Custom mappings can be enabled by using flag `-c <file_path>` and must have the following format:

`<original_value>:<new_value>`

Each custom mapping must be separated by new line. For example:

`test_custom_replacement:test_custom_replacement123`\
`replace_this:with_that`\
`test123:test1234`

### Anonymization data

You can also provide sets of fake data to use when anonymizing.

Consider below log line:

```
{"@timestamp": "2024-06-05T14:59:27.000+00:00", "src_ip":"89.239.31.49", "username":"[email protected]", "organization":"TESTuser.test.com", "mac": "71:e5:41:18:cb:3e"}
{"@timestamp": "2024-06-05T14:59:27.000+00:00", "src_ip":"89.239.31.49", "username":"[email protected]", "organization":"TESTuser.test.com", "mac": "71:e5:41:18:cb:3e", "replacement_test":"replace_this"}
```

First, LogVeil will load anonymization data from supplied directory (`-d example_anon_data/`). Each file in that folder should be named according to the values it will be masking. For example, lets assume we have following directory structure:
If you want to anonymize values in `organization` and `username` keys, you need to have two files of the same name in anonymization data folder and enable them by using `-d <path_to_fake_data_folder>` flag.

1. `username.txt`
2. `organization.txt`

Next, LogVeil will go over each log line in supplied input and extract `key:value` pairs from it. When applied to above log line it would look like this:
Both files should contain appropriate fake data for the values they will be masking.

### Regexp scanning and dynamic fake data generation

1. `"@timestamp": "2024-06-05T14:59:27.000+00:00"`
2. `"src_ip":"89.239.31.49"`
3. `"username":"[email protected]"`
4. `"organization":"TESTuser.test.com"`
5. `"mac": "71:e5:41:18:cb:3e"`
LogVeil implements regular expressions to look for common patterns: IP (v4, v6), Emails, MAC and URL. Once such pattern is found it is replaced with fake data generated on the fly.

Then, LogVeil will try to match extracted pairs to anonymization data it loaded in previous step. Two paris should be matched:
## Output

1. `"src_ip":"89.239.31.49"` with `src_ip.txt`
2. `"username":"[email protected]"` with `username.txt`
3. `"organization":"TESTuser.test.com"` with `organization.txt`
Anonymized data will be written to provided file path in txt format. Alternatively, if you don't provide output file path it will be written to the console (stdout).

And one pair should be matched by regular expression scanning:
Additionally LogVeil will write anonymization proof to `proof.json`, to show which values were anonymized. Proof has a following format:

1. `"mac": "71:e5:41:18:cb:3e"`
```
{"original":"<original_value>", "new":"<new_value>}
```

Now LogVeil will grab values (randomly) from files which filenames matched with keys, generate new value for `mac` key and create a replacement map in format `"original_value":"new_value"`:
## How it works

1. `"89.239.31.49":"10.20.0.53"`
1. `"[email protected]":"ladislav.dosek"`
2. `"TESTuser.test.com":"Apple"`
3. `"71:e5:41:18:cb:3e": "0f:da:68:92:7f:2b"`
**This is only a simplified example and does not match 1:1 with how anonymization is actually implemented**

Now each element from the above list will be iterated over and compared against log line. Whenever `original_value` is found it will be replaced with `new_value`. Outcome should look like this:
Consider below log line. It is formatted in a common `key:value` format.

```
{"@timestamp": "2024-06-05T14:59:27.000+00:00", "src_ip":"10.20.0.53", "username":"ladislav.dosek", "organization":"Apple", "mac": "0f:da:68:92:7f:2b"}
{"@timestamp": "2024-06-05T14:59:27.000+00:00", "src_ip":"89.239.31.49", "username":"[email protected]", "organization":"TESTuser.test.com", "mac": "71:e5:41:18:cb:3e", "replacement_test":"replace_this"}
```

```
{"original": "27.221.126.209", "new": "10.20.0.53"},
"{"original":"[email protected]","new":"ladislav.dosek"}"
"{"original":"TESTuser.test.com","new":"Apple"}"
{"original": "71:e5:41:18:cb:3e", "new": "0f:da:68:92:7f:2b"},
```
First, LogVeil will load anonymization data from supplied directory (`-d example_anon_data/`). Each file in that folder should be named according to the values it will be masking. For example, lets assume we have following directory structure:

### Anonymization data
1. `username.txt`
2. `organization.txt`

Second, if available, LogVeil will load the custom anonymization mapping from user supplied path. For example, assume we have following file `custom_mapping.txt` with below content:

1. `test_custom_replacement:test_custom_replacement123`
2. `replace_this:with_that`
3. `test123:test1234`

Each `key:value` pair which you want to anonymize data must have its equivalent in anonymization data folder.
Now anonymization process can start. LogVeil will grab log lines from supplied input, one by one, and apply anonymization to it three steps:

If anonymization data does not exist for any given `key:value` pair then LogVeil will attempt to use regular expressions to match and replace common values such as: IPv4, IPv6, MAC, Emails and URLs.
1. Replace values based on custom anonymization mapping
2. Replace values based on loaded anonymization data
3. Replace values based on regular expression matching and fake data generation

For example, if you want to anonymize values in `organization` and `username` keys, you need to have two files of the same name in anonymization folder containing some random data.
Final output should look like this:

### Output
```
{"@timestamp": "2024-06-05T14:59:27.000+00:00", "src_ip":"10.20.0.53", "username":"ladislav.dosek", "organization":"Apple", "mac": "0f:da:68:92:7f:2b", "replacement_test":"with_that"}
```

Anonymized data will be outputted to provided file path in txt format.
And anonymization proof:

Alternatively, if you don't provide file path, output will be written to the console.
```
{"original":"replace_this", "new":"with_that"}
{"original": "27.221.126.209", "new": "10.20.0.53"},
{"original":"[email protected]","new":"ladislav.dosek"},
{"original":"TESTuser.test.com","new":"Apple"},
{"original": "71:e5:41:18:cb:3e", "new": "0f:da:68:92:7f:2b"},
```

## Release

Expand Down
52 changes: 34 additions & 18 deletions internal/anonymizer/anonymizer.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@ import (
"fmt"
"log/slog"
"regexp"
"strings"

"github.com/logmanager-oss/logveil/internal/config"
"github.com/logmanager-oss/logveil/internal/generator"
Expand All @@ -16,31 +15,38 @@ import (

// Anonymizer represents an object responsible for anonymizing indivisual log lines feed to it. It contains anonymization data which will be used to anonymize input and a random number generator funtion used to select values from anonymization data.
type Anonymizer struct {
anonData map[string][]string
randFunc func(int) int
proofWriter *proof.ProofWriter
lookup *lookup.Lookup
generator *generator.Generator
replacementMap map[string]string
anonymizationData map[string][]string
customAnonymizationMapping map[string]string
randFunc func(int) int
proofWriter *proof.ProofWriter
lookup *lookup.Lookup
generator *generator.Generator
replacementMap map[string]string
}

func CreateAnonymizer(config *config.Config, proofWriter *proof.ProofWriter) (*Anonymizer, error) {
anonymizingData, err := loader.Load(config.AnonymizationDataPath)
customAnonymizationMapping, err := loader.LoadCustomAnonymizationMapping(config.CustomAnonymizationMappingPath)
if err != nil {
return nil, fmt.Errorf("loading custom anonymization mappings from path %s: %v", config.CustomAnonymizationMappingPath, err)
}

anonymizationData, err := loader.LoadAnonymizationData(config.AnonymizationDataPath)
if err != nil {
return nil, fmt.Errorf("loading anonymizing data from dir %s: %v", config.AnonymizationDataPath, err)
}

return &Anonymizer{
anonData: anonymizingData,
randFunc: rand.Intn,
proofWriter: proofWriter,
lookup: lookup.New(),
generator: &generator.Generator{},
anonymizationData: anonymizationData,
customAnonymizationMapping: customAnonymizationMapping,
randFunc: rand.Intn,
proofWriter: proofWriter,
lookup: lookup.New(),
generator: &generator.Generator{},
}, nil
}

func (an *Anonymizer) Anonymize(logLine map[string]string) string {
an.replacementMap = make(map[string]string)
an.replacementMap = an.customAnonymizationMapping

an.loadAndReplace(logLine)

Expand All @@ -51,7 +57,6 @@ func (an *Anonymizer) Anonymize(logLine map[string]string) string {
an.generateAndReplace(logLineRaw, an.lookup.ValidEmail, an.generator.GenerateRandomEmail())
an.generateAndReplace(logLineRaw, an.lookup.ValidUrl, an.generator.GenerateRandomUrl())

an.proofWriter.Write(an.replacementMap)
an.proofWriter.Flush()

return an.replace(logLineRaw)
Expand All @@ -76,7 +81,7 @@ func (an *Anonymizer) loadAndReplace(logLine map[string]string) {
continue
}

if anonValues, exists := an.anonData[field]; exists {
if anonValues, exists := an.anonymizationData[field]; exists {
newAnonValue := anonValues[an.randFunc(len(anonValues))]
an.replacementMap[value] = newAnonValue

Expand All @@ -98,8 +103,19 @@ func (an *Anonymizer) generateAndReplace(rawLog string, regexp *regexp.Regexp, g
}

func (an *Anonymizer) replace(rawLog string) string {
for oldValue, newValue := range an.replacementMap {
rawLog = strings.ReplaceAll(rawLog, oldValue, newValue)
for originalValue, newValue := range an.replacementMap {
// Added word boundary to avoid matching words withing word. For example "test" in "testing".
r := regexp.MustCompile(fmt.Sprintf(`\b%s\b`, originalValue))

var found bool
rawLog = r.ReplaceAllStringFunc(rawLog, func(originalValue string) string {
found = true
return newValue
})

if found {
an.proofWriter.Write(originalValue, newValue)
}
}

return rawLog
Expand Down
23 changes: 13 additions & 10 deletions internal/anonymizer/anonymizer_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,32 +12,35 @@ import (

func TestAnonimizer_AnonymizeData(t *testing.T) {
tests := []struct {
name string
anonymizingDataDir string
input map[string]string
expectedOutput string
name string
anonymizationDataDir string
customAnonymizationMappingPath string
input map[string]string
expectedOutput string
}{
{
name: "Test AnonymizeData",
anonymizingDataDir: "../../tests/data/anonymization_data",
name: "Test AnonymizeData",
anonymizationDataDir: "../../tests/data/anonymization_data",
customAnonymizationMappingPath: "../../tests/data/custom_mappings.txt",
input: map[string]string{
"@timestamp": "2024-06-05T14:59:27.000+00:00",
"src_ip": "10.10.10.1",
"src_ipv6": "7f1d:64ed:536a:1fd7:fe8e:cc29:9df4:7911",
"mac": "71:e5:41:18:cb:3e",
"email": "test@test.com",
"email": "atest@atest.com",
"url": "https://www.testurl.com",
"username": "miloslav.illes",
"organization": "Microsoft",
"raw": "2024-06-05T14:59:27.000+00:00, 10.10.10.1, 7f1d:64ed:536a:1fd7:fe8e:cc29:9df4:7911, miloslav.illes, Microsoft, 71:e5:41:18:cb:3e, [email protected], https://www.testurl.com",
"custom:": "replacement_test",
"raw": "2024-06-05T14:59:27.000+00:00, 10.10.10.1, 7f1d:64ed:536a:1fd7:fe8e:cc29:9df4:7911, miloslav.illes, Microsoft, 71:e5:41:18:cb:3e, [email protected], https://www.testurl.com, replace_this",
},
expectedOutput: "2024-06-05T14:59:27.000+00:00, 10.20.0.53, 8186:39ac:48a4:c6af:a2f1:581a:8b95:25e2, ladislav.dosek, Apple, 0f:da:68:92:7f:2b, [email protected], http://soqovkq.com/NfkcUjG.php",
expectedOutput: "2024-06-05T14:59:27.000+00:00, 10.20.0.53, 8186:39ac:48a4:c6af:a2f1:581a:8b95:25e2, ladislav.dosek, Apple, 0f:da:68:92:7f:2b, [email protected], http://soqovkq.com/NfkcUjG.php, with_that",
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
anonymizer, err := CreateAnonymizer(&config.Config{AnonymizationDataPath: tt.anonymizingDataDir}, &proof.ProofWriter{IsEnabled: false})
anonymizer, err := CreateAnonymizer(&config.Config{AnonymizationDataPath: tt.anonymizationDataDir, CustomAnonymizationMappingPath: tt.customAnonymizationMappingPath}, &proof.ProofWriter{IsEnabled: false})
if err != nil {
t.Fatal(err)
}
Expand Down
24 changes: 18 additions & 6 deletions internal/config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,19 @@ package config

import (
"flag"
"fmt"
"os"
)

// Config represents user supplied program input
type Config struct {
AnonymizationDataPath string
InputPath string
OutputPath string
IsVerbose bool
IsLmExport bool
IsProofWriter bool
AnonymizationDataPath string
InputPath string
OutputPath string
CustomAnonymizationMappingPath string
IsVerbose bool
IsLmExport bool
IsProofWriter bool
}

// LoadAndValidate loads values from user supplied input into Config struct and validates them
Expand All @@ -20,11 +23,20 @@ func (c *Config) LoadAndValidate() {

flag.Func("i", "Path to input file containing logs to be anonymized", validateInput(c.InputPath))

flag.Func("c", "Path to input file containing custom anonymization mappings", validateInput(c.CustomAnonymizationMappingPath))

flag.Func("o", "Path to output file (default: Stdout)", validateOutput(c.OutputPath))

flag.BoolVar(&c.IsVerbose, "v", false, "Enable verbose logging (default: Disabled)")
flag.BoolVar(&c.IsLmExport, "e", false, "Change input file type to LM export (default: LM Backup)")
flag.BoolVar(&c.IsProofWriter, "p", true, "Disable proof wrtier (default: Enabled)")

flag.Parse()

// Check if mandatory flags are set
if c.InputPath == "" {
fmt.Println("Error: -i flag is mandatory")
flag.Usage()
os.Exit(1)
}
}
Loading

0 comments on commit d0c6170

Please sign in to comment.