Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows support WIP #3448

Merged
merged 5 commits into from
Dec 28, 2024
Merged

Windows support WIP #3448

merged 5 commits into from
Dec 28, 2024

Conversation

Aexylus
Copy link
Contributor

@Aexylus Aexylus commented Dec 22, 2024

WSL turned out to be a bit annoying so I'm trying to fix the Windows issues. Very WIP, more like exploration.

Fixing relativePath slashes and command spawns fixes the zero-build-schema and zero-cache-dev not found errors.

Both these changes seem to work fine in WSL.

Now it gets stuck here:
I'm guessing it might be another path issue somewhere, but the workers code is a bit more complex.

worker=main pid=33812 waiting for zero-cache to be ready ...
worker=main pid=33812 component=life-cycle user-facing worker 17140 exited with code (1)
worker=main pid=33812 component=life-cycle exiting with code 1
worker=main pid=33812 component=life-cycle all user-facing workers drained (1734910723494 ms)
worker=main pid=33812 component=life-cycle exiting with code 0

But I'm a bit stuck what to look for next. Will try more when I have some more time. I'd appreciate any insight.

Copy link

vercel bot commented Dec 22, 2024

@Aexylus is attempting to deploy a commit to the Rocicorp Team on Vercel.

A member of the Team first needs to authorize it.

@tantaman
Copy link
Contributor

tantaman commented Dec 23, 2024

thanks for taking a stab at this. Those errors are likely coming from inside the syncer workers we spawn. Their entrypoint is here:

export default async function runWorker(

There is a reference to a file path for opening the replica db:

const replicaFile = replicaFileName(config.replicaFile, fileMode);

but presumably that is correct since it is coming from your config.

Do your worker processes die before hitting any of the log lines in syncer?

@tantaman
Copy link
Contributor

If you run under git-bash on windows you get a useful error message about not being able to find server/main.js.

Looks like there is a bug in converting the worker path to an absolute path as the conversion results in a double C:\C:\ on windows.

Here's the offending code:

export function childWorker(

@tantaman
Copy link
Contributor

tantaman commented Dec 23, 2024

I think just swapping it to import via url rather than absolute path will fix it.

So:

const absModulePath = new URL(`../${modulePath}`, import.meta.url).pathname;

needs to be

const moduleUrl = new URL(`../${modulePath}`, import.meta.url);

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 23, 2024

I'm starting to lose my mind a lil bit :D

If I'm not mistaken the zero-cache-dev builds the schema and spawns

zeroCacheProcess = spawn(zeroCacheScript, zeroCacheArgs || [], {

the file below and waits for workers to be ready

worker: childWorker('./server/main.ts', {

In here it spawns the change-streamer and gets to around this line before I think the server/multi/main.ts ProcessManager receives the close event.

await changeStreamerReady;

So before await changeStreamerReady; finishes something somewhere dies or gets killed. But I can't find any place where any error happens. And if I add sleep() it dies in different places. So something else has to be running. The change-streamer can't be it because it doesn't even start. So am I missing any other process running other than server/multi/main.ts and server/main.ts?

Basically I can't find anything failing except the server/main.ts user-facing worker keeps exiting.

PS: I already fixed the path issue in childWorker this might work in both places but I haven't tested on unix yet

let absModulePath = fileURLToPath(new URL(`../${modulePath}`, import.meta.url));

This issue I think nodejs/node#37845

@tantaman
Copy link
Contributor

tantaman commented Dec 23, 2024

This works on windows for me: #3450

can you try it on your end?


"works" as in it starts but I haven't tested replication. Looks like you're further than me reading your latest message. Adding postgres to the mix now to see if it replicates too.

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 23, 2024

Yeah, that works as well.

Yep, I'm trying to make this works because turns out WSL kinda sucks. Twice in couple hours either vscode suddenly couldn't access newly created file. Or TS just died for a single file. And you have restart WSL to fix it.

@tantaman
Copy link
Contributor

tantaman commented Dec 23, 2024

The changes I added in #3450 worked for me for getting replication working end to end.

I did have a few problems with ZERO_REPLICA_FILE. I ended up having to make it a bare ZERO_REPLICA_FILE = "replica.db" for it to work, no slashes anywhere. So there's something going wrong there.

I get websocket errors when trying to connect from the browser though. Still need to figure that out.

image

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 23, 2024

The issue with the replica file is AFAIK that it doesn't create the path if the folder doesn't exist. Also /tmp/ isn't valid on Windows probably.

It's interesting that works for you because I still have the same issue. I'm editing the node_modules files in the hello example to make it easier. So I cloned the repo again and made only the changes in the PR and I still get the same error. That is no error just exit.

When I was trying to make it work on WSL yesterday every time I forgot db path or the replica file was wrong it failed with a relevant error. But here nothing.


It's the same even in clean Windows 11 VM except the exit code is different

worker=main pid=12324 starting task 1vl42dR35BByVIXdf-44m
worker=main pid=12324 waiting for zero-cache to be ready ...
worker=main pid=12324 component=life-cycle user-facing worker 12460 exited with code (4294967295)
worker=main pid=12324 component=life-cycle exiting with code 4294967295

path.relative(dirname, path.dirname(absoluteConfigPath)),
path.basename(absoluteConfigPath),
);

relativePath = relativePath.replace(/\\/g, "/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why'd you need this change? I would have thought that the slashes were correct due to the use of path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what gets passed into the tsImport

..\..\..\..\..\..\..\..\..\apps\app\schema
file:///C:/zero/node_modules/.pnpm/@[email protected]/node_modules/@rocicorp/zero/out/zero-schema/src/build-schema.js

I can't find it now in the tsx codebase. But I'm guessing it can't deal with the mismatched slashes. I kinda gave up some time ago trying to understand how node handles paths and the differences between Windows and Linux/OSX

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on the line. Something like:

// tsImport doesn't expect to receive slashes in the Windows format when running
// on windows. They need to be converted to *nix format.

@cbnsndwch
Copy link
Contributor

cbnsndwch commented Dec 24, 2024

Stopping by to say that @aboodman and I figured out where the crash is coming from on Discord

The ZERO_UPSTREAM_MAX_CONNS var defaults to 20 if not specified, while ZERO_NUM_SYNC_WORKERS defaults to $CPUS_AVAILABLE - 1.

On a system with more than 20 cores this leads to an insufficient upstream connections error.

Because the fork calls in zero-cache/src/types/processes don't propagate stdout and stderr from workers this error is then lost and the server fails silently.

I added optional piping of stdout and stderr (left it commented out) for when something like this happens again in the future.

PR #3452

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 24, 2024

The dying out of nowhere is because as you said the child errors don't propagate. Which is weird because in WSL they do. Simple throw shows the error message but nothing on Windows.

Another weird thing is that uncommenting

silent: true,

in the childWorker fork fixed it for me. Literally works, comment again, doesn't work, uncomment works.

But the issue with that is that the child processes don't get killed upon exit. So after stopping the zero-cache in terminal you have to kill all the node processes manually.

@cbnsndwch
Copy link
Contributor

The process might be dying because with the default setting (silent: false) child errors need to be handled explicitly if they're caught in the parent

E.g.: during initialization there's a Promise.all on the promises of all workers, and a timeout around that, so that if one of them throws and the exception is unhandled the entire thing dies

Are you able to replicate the issue reliably?

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 24, 2024

I guess so. I made clean repo, made the changes and enabled the child log. And without silent it fails the same way as before. But with it it works fine. (Or at least gets further and doesn't die, didn't fully test it's okay on the hello repo. But it works on my other repo)

worker=main pid=45156 starting task MJiyAtS01OsCC-lYd9U0K
worker=main pid=45156 waiting for zero-cache to be ready ...
worker=main pid=45156 component=process-manager user-facing worker 45384 exited with code (1)
worker=main pid=45156 component=process-manager all user-facing workers exited
worker=main pid=45156 component=process-manager exiting with code 0
worker=main pid=45156 component=process-manager stopping process-manager
> zero-cache exited. Exiting.

path.relative(dirname, path.dirname(absoluteConfigPath)),
path.basename(absoluteConfigPath),
);

relativePath = relativePath.replace(/\\/g, "/");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment on the line. Something like:

// tsImport doesn't expect to receive slashes in the Windows format when running
// on windows. They need to be converted to *nix format.

@Aexylus
Copy link
Contributor Author

Aexylus commented Dec 27, 2024

I think I traced the issue to a ... console.log in your logger.

Error: EBADF: bad file descriptor, write
    at writeSync (node:fs:925:3)
    at SyncWriteStream._write (node:internal/fs/sync_write_stream:27:5)
    at writeOrBuffer (node:internal/streams/writable:570:12)
    at _write (node:internal/streams/writable:499:10)
    at Writable.write (node:internal/streams/writable:508:10)
    at console.value (node:internal/console/constructor:291:16)
    at console.log (node:internal/console/constructor:374:26)
    at Object.log (file:///F:/zero/node_modules/@rocicorp/logger/out/logger.js:56:12)
    at LogContext.info (file:///F:/zero/node_modules/@rocicorp/logger/out/logger.js:23:54)
    at runSchemaMigrations (file:///F:/zero/node_modules/@rocicorp/zero/out/zero-cache/src/db/migration.js:49:19)

With the silent:true everything seems to work. If I remove it then it crashes. And if I then comment the console.log here, it all works again.

export const consoleLogSink = {
    log(level, context, ...args) {
        console[level](...stringified(context), ...args.map(normalizeArgument));
    },
};

Also apparently it's not this specific console.log but a console.log at some point. As I see at least two more different locations where it happens.

@tantaman tantaman enabled auto-merge (rebase) December 28, 2024 02:27
Copy link

vercel bot commented Dec 28, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
replicache-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 28, 2024 2:29am
zbugs ✅ Ready (Inspect) Visit Preview Dec 28, 2024 2:29am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants