Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hashing for verifying correct input of code #72

Open
Chillee opened this issue Apr 25, 2019 · 9 comments
Open

Add hashing for verifying correct input of code #72

Chillee opened this issue Apr 25, 2019 · 9 comments

Comments

@Chillee
Copy link
Collaborator

Chillee commented Apr 25, 2019

See #63 (comment)

I don't think that hashing sections is worth it. MIT does hashing in 8 snippets: LCT, LinearRecurrence,Simplex.h, Polynomial,CycleCounting,GraphDominator, and both suffix arrays

I would split that into
"Should be split into different sections": Polynomial, CycleCounting
"trying to avoid hashing the typedef": LinearRecurrence
"Has parts that you don't always want": Both suffix arrays (ie: don't always need LCP)
"Not sure": LCT, Simplex, and GraphDominator (I don't know enough about the algorithms to understand whether you pretty much always want all functions)

That's a maximum of 4 snippets where it might be advantageous to have section-wise hashing.

The other argument for hashing sections is that if the hash fails, then you need to look at less of your code. I haven't done many offline contests with a TCR, but from my experience, knowing that you have a mistype in 50 lines of code is only marginally better than knowing you have a mistype in 100 lines of code. Both of these are massively better than not knowing whether you have a mistype or a logic error.

If we were to hash by section, I would propose having some kind of lightweight syntax (like a //<-- ) to demarcate sections, and then putting the hashes (truncated to 5 characters) in the header.

Like so:
image

Another question with hashing is how we deal with things like typedefs, especially if they're typedefs that are likely to be typed multiple times (for example, typedef vector<ll> Poly). I think it's not too big of a deal, I would suggest to just get used to typing them in for the purpose of hashing.

My biggest problem with avoiding them automatically is ambiguity with what hashes represent. "We hash everything that's printed" is obvious. "We hash everything after the typedefs" is less obvious.

@simonlindholm
Copy link
Member

There are actually a fair number of cases where you might/will type in only parts of the code: Treap, FastSubsetTransform (that one's weird), euclid, chinese, 2sat, TreePower, HLD (on the chopping block), sideOf, Angle, KMP, SuffixTree, Hashing, AhoCorasick, IntervalContainer. And in several more I can imagine that the 100->50 line reduction is handy. So if we could come up with some slick UI for indicating sections I'd be all for it. I agree with your comment about ambiguity, though, and I think we can start simple.

@ecnerwala
Copy link

Just a note: I updated the hash script in our book to include the -dD flag, which preserves macro definitions. It's now cpp -dD -P -fpreprocessed | tr -d '[:space:]' | md5sum -

@ecnerwala
Copy link

The other argument for hashing sections is that if the hash fails, then you need to look at less of your code. I haven't done many offline contests with a TCR, but from my experience, knowing that you have a mistype in 50 lines of code is only marginally better than knowing you have a mistype in 100 lines of code. Both of these are massively better than not knowing whether you have a mistype or a logic error.

I think knowing you have a mistype in 50 vs 100 lines of code is actually linearly (~2x) better for finding the bug, which amounts to maybe 5 minutes of time (and feeling a lot happier).

@ecnerwala
Copy link

ecnerwala commented Apr 26, 2019

Also, I'll note that we would've hashed sections in more files if we used them more/weren't too lazy to add the annotations. Honestly, we mostly used kactl for the stuff we added (which we broke into sections) and the geometry (which is short to begin with).

@simonlindholm
Copy link
Member

Thanks for the note, I've made that change: dcdc34a (note also the golfed vimrc: ca Hash w !cpp -dD -P -fpreprocessed \| tr -d '[:space:]' \| md5sum \| cut -c-6)

@lrvideckis
Copy link
Contributor

Hi, I want to propose an idea for "partial hashes", idea communicated to me by https://codeforces.com/profile/camc

let's say you want a struct:

struct LCA {
...
	LCA(vector<vi>& C) : time(sz(C)), rmq((dfs(C,0,-1), ret)) {}
	void dfs(vector<vi>& C, int v, int par) {
...
	}

	int lca(int a, int b) {
		if (a == b) return a;
		tie(a, b) = minmax(time[a], time[b]);
		return path[rmq.query(a, b)];
	}
	int dist(a,b) {return depth[a] + depth[b] - 2*depth[lca(a,b)];}
        int inSubtree(a,b) {return time[a] <= time[b] && time[b] < timeOut[a];}
        int nodeOnPath(u,v,w) {...}
...
};

you can split it up like:
LCA.h:

struct LCA {
...
	LCA(vector<vi>& C) : time(sz(C)), rmq((dfs(C,0,-1), ret)) {}
	void dfs(vector<vi>& C, int v, int par) {
...
	}
#include "lcaFunc.h"
#include "dist.h"
#include "inSubtree.h"
#include "nodeOnPath.h"
};

lcaFunc.h:

#pragma once
	int lca(int a, int b) {
		if (a == b) return a;
		tie(a, b) = minmax(time[a], time[b]);
		return path[rmq.query(a, b)];
	}

dist.h:

#pragma once
	int dist(a,b) {return depth[a] + depth[b] - 2*depth[lca(a,b)];}

inSubtree.h:

#pragma once
        int inSubtree(a,b) {return time[a] <= time[b] && time[b] < timeOut[a];}

... etc


Now each member function is in it's own file, thus has it's own hash. Furthermore, you type exactly what you need: if you only need lca function, you only type it;verify hash, then copy into struct.

If you need lca,dist, inSubtree, you type all three, verify all their hashes, then copy them into the struct

Furthermore, the include statements tell you exactly where to put the member functions

@lrvideckis
Copy link
Contributor

Now you don't want to force the user to type those include statements, so for me, when I generate the .pdf, I have this in a script:

contest/hash.sh:

tr -d '[:space:]' | md5sum | cut -c-6

generate_pdf.sh:

shopt -s globstar
for header in ../content/**/*.h; do
	hash=$(sed '/^#include/d' "$header" | cpp -dD -P -fpreprocessed | ./../contest/hash.sh)
	sed --in-place "1i //hash: $hash" "$header"
done

@lrvideckis
Copy link
Contributor

furthermore, if you use something like the expander script for codeforces rounds where you can copy-paste; this method should still work

@lrvideckis
Copy link
Contributor

lrvideckis commented Mar 7, 2024

for example for you can split apart fenwick tree lower bound https://github.com/kth-competitive-programming/kactl/blob/main/content/data-structures/FenwickTree.h#L24 as you rarely need that function

for example for this

https://github.com/kth-competitive-programming/kactl/blob/main/content/graph/CompressTree.h#L18

where you pass in LCA& lca as a parameter, Instead, you could add compressTree as a member function of LCA; splitting up files using this trick; now no need to pass in lca as a param; also instead of lca.lca(a, b) syntax, it's now lca(a, b) syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants