Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swarm switching not clearing obsolete messages #470

Open
jagerman opened this issue Feb 4, 2023 · 12 comments
Open

Swarm switching not clearing obsolete messages #470

jagerman opened this issue Feb 4, 2023 · 12 comments
Labels
bug Something isn't working

Comments

@jagerman
Copy link
Member

jagerman commented Feb 4, 2023

(Unconfirmed) Reportedly when a node switches swarms the stored messages roughly doubles (then slowly goes back to normal after 14 days). The new incoming messages makes sense (we sync on swarm change), but the doubling suggests that we aren't clearing messages that belong to the old swarm, and we should be; we can't retrieve such messages because we'll give a "wrong swarm" response if the owning user tries to fetch from us, so this is just dead data.

@jagerman jagerman added the bug Something isn't working label Feb 4, 2023
@jagerman jagerman changed the title Swarm switching Swarm switching not clearing obsolete messages Feb 4, 2023
@ghost
Copy link

ghost commented Feb 4, 2023

I hate writing regex, so I put a copy of my quick and dirty regex script in case it helps anyone extract parameters from storage stats log to a csv file:

(To be honest, most of regexes were written or drafted by chatGPT)

import os
import re

def get_date(l):
  date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
  match = re.search(date_pattern, l)
  return match.group(0)

def get_sw_id(l):
  sw_pattern = re.compile(r'sw=....…fff') # This will miss the edge case of sw=NONE
  match = re.search(sw_pattern, l)
  if not match:
    return None
  return match.group(0)

def get_msgs_size_users(l):
  extraction_pattern = re.compile(r'(\d+)\s+msgs\s+\(\s*(\d+.*d*[kM]B)\)\s+for\s+(\d+)\s+users')
  match = re.search(extraction_pattern, l)
  return match.groups()

def get_requests(l):
  # "reqs(S/R/O/P): 112/2416/7446/53"
  extraction_pattern = re.compile(r'reqs\(S/R/O/P\): (\d+)/(\d+)/(\d+)/(\d+)')
  match = re.search(extraction_pattern, l)
  return match.groups()

def get_window_size(l):
  # "(last 70.0min)"
  extraction_pattern = re.compile(r'\(last (\d+.\d[a-z]{2,3})\)')
  match = re.search(extraction_pattern, l)
  return match.groups()

def extract_csv_from_logs(pubkey_prefix, logs):
  for l in logs:
    date = get_date(l)
    sw_id = get_sw_id(l)
    if not sw_id:
      continue
    msgs, size, users = get_msgs_size_users(l)
    s, r, o, p = get_requests(l)
    window_size, = get_window_size(l)
    print(pubkey_prefix, date, sw_id, msgs, size, users, s, r, o, p, window_size, sep=',')


directory = "logs_per_day"
# 'logs_per_day/oxen_storage_server_stats_****************************************************************.txt'

print("public_key", "date", "swarm_id", "msgs", "size", "users", "s", "r", "o", "p", "window_size", sep=',')
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        filepath = os.path.join(directory, filename)
        with open(filepath, 'r') as f:
            logs = f.readlines()
            pubkey_prefix = filename[26:34]
            extract_csv_from_logs(pubkey_prefix, logs)

@ghost
Copy link

ghost commented Feb 4, 2023

Updated: Examples of hourly log showing the duplicate, each row represents 1 hour.

Example 1, nearly triple of user count

sw=XXXXXXXX(n=6); 108712 msgs (118MB) for 1011 users; reqs(S/R/O/P): 1830/2589/15787/566 (last 70.0min)
sw=XXXXXXXX(n=6); 108947 msgs (118MB) for 1016 users; reqs(S/R/O/P): 1114/2952/26752/454 (last 70.0min)
sw=YYYYYYYY(n=4); 323872 msgs (242MB) for 2934 users; reqs(S/R/O/P): 787/3047/22710/163 (last 70.0min)
sw=YYYYYYYY(n=5); 323742 msgs (242MB) for 2936 users; reqs(S/R/O/P): 1046/2521/17319/142 (last 70.0min)

Example 2, nearly double of user count

sw=XXXXXXXX(n=7); 194875 msgs (262MB) for 1269 users; reqs(S/R/O/P): 0/0/0/0 (last 145ms)
sw=NONE; 194821 msgs (262MB) for 1269 users; reqs(S/R/O/P): 0/0/0/0 (last 98.1ms)
sw=NONE; 194633 msgs (262MB) for 1269 users; reqs(S/R/O/P): 0/0/0/0 (last 75.5ms)
sw=YYYYYYYY(n=7); 307953 msgs (262MB) for 2203 users; reqs(S/R/O/P): 0/0/0/0 (last 90.1ms)

@ghost
Copy link

ghost commented Feb 4, 2023

Side note: I also have anecdotal evidence of seeing a few fresh new registered nodes having double size (than usual) of user counts in the beginning, then gradually falling to normal after 14 days.

@ghost
Copy link

ghost commented Apr 2, 2023

(Unconfirmed) Reportedly when a node switches swarms the stored messages roughly doubles (then slowly goes back to normal after 14 days). The new incoming messages makes sense (we sync on swarm change), but the doubling suggests that we aren't clearing messages that belong to the old swarm, and we should be; we can't retrieve such messages because we'll give a "wrong swarm" response if the owning user tries to fetch from us, so this is just dead data.

None of tables among messages owned_messages owners revoked_subkeys have any column to record the old swarm id, and there is no table to map owners to swarm id in the database, I'm quite sure messages that belong to the old swarm aren't cleared when the storage server switches to a new swarm.

In theory the storage server can clean up messages belong to the old swarm before downloading messages belong to the new swarm, so the above argument might not be sufficient. I briefly checked the code but I can't find any evidence that the storage server do any clean up though.

@ghost
Copy link

ghost commented Jun 26, 2023

"""
This script connects to a SQLite database, reads data into a DataFrame, converts a column of public keys
into their corresponding swarm positions, and creates a scatter plot with custom hexadecimal labels and hover text.
"""

import sqlite3
import numpy as np
import pandas as pd
import struct
import plotly.graph_objects as go

STORAGE_PATH = 'storage.db'
CSV_PATH = "swarm_position_hex.csv"

def pubkey_to_swarm_space_position(pk: str) -> int:
    """
    Convert public key to position in swarm space.

    Parameters:
    pk (str): Public key in hex format

    Returns:
    int: The position in swarm space
    """
    if len(pk) != 64:
        logging.error('Incorrect public key length')
        return
    res = 0
    bytes_pk = bytes.fromhex(pk)
    for i in range(4):
        buf, = struct.unpack('Q', bytes_pk[i*8:(i+1)*8])
        res ^= buf
    return struct.unpack('!Q', struct.pack('Q', res))[0]

# Connect to the SQLite database
conn = sqlite3.connect(STORAGE_PATH)

# Read the "owners" table into a DataFrame
df = pd.read_sql_query("SELECT hex(pubkey) as pubkey FROM owners", conn)

# Don't forget to close the connection
conn.close()

df['swarm_position'] = df['pubkey'].apply(pubkey_to_swarm_space_position)
df['dummy_label'] = 1

# Convert 'swarm_position' to hexadecimal and save it as a new column
df['swarm_position_hex'] = df['swarm_position'].apply(hex)

df['swarm_position_hex'].sort_values().reset_index(drop=True).to_csv(CSV_PATH, index=False)

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df['swarm_position'],
    y=df['dummy_label'],
    mode='markers',
    text=df['swarm_position_hex'],
    hovertemplate='Swarm Position (Hex): %{text}<br>Label: %{y}',
    name='Session account swarm positions'
))

# Calculate step based on the data range for x-axis labels
total_swarm_space = 2 ** 64
step = total_swarm_space // 32

# Create custom x-axis labels
x_tickvals = np.arange(0, total_swarm_space, step=step)
x_ticktext = [hex(int(i)) for i in x_tickvals]

fig.update_layout(
    xaxis=dict(
        tickmode='array',
        tickvals=x_tickvals,
        ticktext=x_ticktext
    ),
    yaxis_title='Label',
    xaxis_title='Swarm Position'
)

# Show the plot
fig.show()

@ghost
Copy link

ghost commented Jun 26, 2023

I've created a script to illustrate my comment at #470 (comment). (written by ChatGPT, ignore some silly comments in the python code)

@jagerman, the next time you start a fresh new OPTF node, you can try running this script with storage.db in the first 14 days. It will convert all Session IDs in the database into their corresponding swarm partition space IDs (64-bit integers), save them as a csv file, and plot them on a chart. In my case, I observed that these swarm partition space IDs are clearly distributed in two clusters. I have manually verified that one cluster matches the current swarm ID, while the other does not.

extra-swarm-space

Upon examining the csv file, I found that the "correct cluster" had 1859 IDs on the first day, while the "incorrect cluster" had 1471 IDs. Over the following 14 days, the number of IDs in the "correct cluster" remained within a stable range, whereas the number of IDs in the "incorrect cluster" decreased to zero.

Note that this case is not a swarm switching, it's a fresh new node start with zero messages that suddenly owns messages associated to two swarm IDs. I wonder if there is any race condition causing two competing swarms to store messages to the current swarm at the same time.

@ghost
Copy link

ghost commented Jun 29, 2023

Another hypothesis:

Those unexpected messages, presumed to belong to another swarm but surprisingly observed in our new node, might be a consequence of the "swarm switching not clearing obsolete messages" issue.

This situation could occur if a neighboring node within the same swarm happens to undergo a swarm switch within the last 14 days and encounters the "swarm switching not clearing obsolete messages" bug. As a result, this neighboring node may unintentionally transmit those obsolete messages to our new node during its initialization. This would lead to our node being contaminated by messages from a different swarm, even if the initialization process itself is legitimate.

I don't have any evidence to either prove or disprove this hypothesis.

@jagerman
Copy link
Member Author

Swarm membership is (overly) complex, which means, unfortunately, that it's also hard to track which nodes were in and moved between swarms without essentially replaying the entire blockchain.

When a swarm joins a swarm there should only be a one-way transfer of messages from existing swarm members to a new member, but not the other way around.

It is, perhaps, possible that a new node joined a swarm and received stale messages from another swarm member who transferred into that swarm recently (and so has a hangover of messages it didn't clean up). It should probably be rejecting those (since they are for the wrong swarm), but quite possibly isn't.

Is there an observable pattern to the expiry times of messages here? E.g. if the invalid swarm ids stopped a few days prior to your node joining this swarm that would suggest this is what is happening.

@ghost
Copy link

ghost commented Jun 29, 2023

It is, perhaps, possible that a new node joined a swarm and received stale messages from another swarm member who transferred into that swarm recently (and so has a hangover of messages it didn't clean up). It should probably be rejecting those (since they are for the wrong swarm), but quite possibly isn't.

This is what I meant in my previous comment, maybe I didn't explain it clear, but yes I feel that this is very likely.

Is there an observable pattern to the expiry times of messages here? E.g. if the invalid swarm ids stopped a few days prior to your node joining this swarm that would suggest this is what is happening.

I'll take a look and report back.

@ghost
Copy link

ghost commented Jul 1, 2023

illegal immigrant

Is there an observable pattern to the expiry times of messages here? E.g. if the invalid swarm ids stopped a few days prior to your node joining this swarm that would suggest this is what is happening.

Yes, you are right, there is a clear pattern.

I generated a histogram using the data from the first day of a brand-new node.

In this new node there are both invalid IDs, labeled as "illegal immigrants," and valid IDs, or "legal residents."

Each column in the histogram denotes an hour, amounting to a total of 14 * 24 = 336 hours. The height of the columns signifies the number of messages within each hour, revealing a daily pattern.

What's interesting is the few days of absence of "illegal immigrant" messages.
This suggests there's been a gap of several days between:
(a) when a member of a different swarm, neglecting to clean up obsolete messages while migrating to the current swarm, and,
(b) when our node joined the same swarm and inherited both valid IDs and invalid IDs.

This also explains why the number of invalid IDs (1471) and valid IDs (1859) do not match.

@ghost
Copy link

ghost commented Jul 1, 2023

This is the code to draw the chart:

import sqlite3
import pandas as pd
import plotly.graph_objects as go
import struct

MY_SWARM_ID = <redacted>
MAX_SWARM_WIDTH = 0x100000000000000

def pubkey_to_swarm_space_position(pk: str) -> int:
    """Convert pubkey to position in swarm space."""
    if len(pk) != 64:
        print('Incorrect pubkey length')
        return
    res = 0
    bytes_pk = bytes.fromhex(pk)
    for i in range(4):
        buf, = struct.unpack('Q', bytes_pk[i*8:(i+1)*8])
        res ^= buf
    return struct.unpack('!Q', struct.pack('Q', res))[0]

def is_illegal_immigrant(swarm_position):
    return abs(swarm_position - MY_SWARM_ID) > MAX_SWARM_WIDTH / 2

# Connect to the SQLite database
conn = sqlite3.connect('storage.db')

# Read the "owned_messages" table into a DataFrame
df = pd.read_sql_query("SELECT hex(pubkey) as pubkey, timestamp FROM owned_messages", conn)

df['swarm_position'] = df['pubkey'].apply(pubkey_to_swarm_space_position)
df['illegal_immigrant'] = df['swarm_position'].apply(is_illegal_immigrant)

# Convert timestamp to hour duration since min(timestamp)
df['timestamp_hour'] = (df['timestamp'] - df['timestamp'].min()) / (1000 * 60 * 60)

# Create histograms for the two types of immigrants
fig = go.Figure()

# Add traces for legal residents and illegal immigrants
fig.add_trace(go.Histogram(x=df[df['illegal_immigrant'] == 0]['timestamp_hour'], name='Legal Resident'))
fig.add_trace(go.Histogram(x=df[df['illegal_immigrant'] == 1]['timestamp_hour'], name='Illegal Immigrant'))

# Update layout for better visualization
fig.update_layout(barmode='overlay')

# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)

fig.show()

@ghost
Copy link

ghost commented Aug 8, 2023

When a swarm joins a swarm there should only be a one-way transfer of messages from existing swarm members to a new member, but not the other way around.

Actually, recently my long term node suddenly onboards 1344 illegal immigrants (I manually verified their session IDs with the partitioning algorithm), increasing from 1965 to 3309 users just in an hour, despite my swarm ID has not changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant