Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about checking linearizability of a history with a custom model #26

Closed
metanet opened this issue Aug 16, 2018 · 10 comments
Closed

Comments

@metanet
Copy link

metanet commented Aug 16, 2018

Hi there,

I am not sure if I should open this issue to this repo or the jepsen repo. I am giving a shot here and can port the issue to the jepsen repo if requested.

I have a very simple jepsen test taking 20 seconds and injecting no faults in the environment. It is a mutex test for now, but I will use it for testing re-entrant mutex in future once I make this version work.

In my test, i have 5 clients. Each client is acquiring and releasing the same distributed lock object repeatedly.

Here is my generator:

     :generator (->> [{:type :invoke, :f :acquire}
                                      ;(gen/sleep 1)
                                      {:type :invoke, :f :release}
                                      (gen/sleep 1)]
                                     cycle
                                     gen/seq
                                     gen/each
                                     (gen/stagger 1/10))

Here is my custom model:

(defrecord ReentrantMutex [owner lockCount]
  Model
  (step [this op]
    (if (nil? (:value op))
      (knossos.model/inconsistent "no owner!")
      (condp = (:f op)
        :acquire (if (or (nil? owner) (= owner (:value op)))
                   (createReentrantMutex this (:value op) (inc lockCount) op)
                   (do
                     ;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot acquire")))
        :release (if (or (nil? owner) (not= owner (:value op)))
                   (do
                     ;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot release"))
                   (if (= lockCount 1)
                     (createReentrantMutex this nil 0 op)
                     (createReentrantMutex this owner (dec lockCount) op))))))

  Object
  (toString [this] (str "owner: " owner ", lockCount: " lockCount)))

(defn createEmptyReentrantMutex []
  (let [
        initial (ReentrantMutex. nil 0)
        ;_ (info (str "initial model: " (System/identityHashCode initial)))
        ]
    initial)
  )

(defn createReentrantMutex [prev owner lockCount op]
  (
    let [
         newModel (ReentrantMutex. owner lockCount)
         ;_ (info (str "owner: " owner ", count: " lockCount ", op: " op " new hash: " (System/identityHashCode newModel) " prev owner: " (:owner prev) " prev lock count: " (:lockCount prev) " prev hash:" (System/identityHashCode prev)))
         ]
    newModel)
  )

My model is the ReentrantMutex defrecord. Initially, it is ReentrantMutex(nil, 0), means nobody holding the lock. For instance, if n1 acquires the lock, it will be ReentrantMutex("n1", 1).

When I run this test for 20 seconds without introducing any failure in the system, I get a history of operations.

I am sharing my history files and the timeline file here.
history.txt
history.edn.txt
timeline.html.txt

Sorry for the file extensions. Please rename "history.edn.txt" to "history.edn" and "timeline.html.txt" to "timeline.html.

In short, the history of operations is this one: N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ => N4_REL => N1_ACQ => N1_REL => N5_ACQ => N5_REL => N3_ACQ => N3_REL => N2_ACQ => N2_REL => N4_ACQ

So the lock is held by N3, N2, N4, N1, N5, N3, N2, N4, N1, N5, N3, N2, N4 in order. You can verify this by checking the history files and the timeline file.

The thing is, when the analyzer starts running with my model, it never terminates (when info logs above are commented out). I cannot say that I fully understand how it works internally but I see that it is walking through possible histories based on concurrency of operations, but each one failing with one of the knossos.model/inconsistent returns in my model. Moreover, as I run it with logs printed, I see states like ReentrantMutex("n2", 555). Probably, I am misunderstanding something, but I don't expect to see such states because acquire and release operations of the same client are sequential. Since I have a short history, I expect the analyzer to complete the verification step in short time. Therefore, I suspect I am doing something wrong in my model, or using the tool in some wrong way.

I have tried many things, but couldn't manage to make this model work. Could you help me with this issue? Any hint or a pointer would be very helpful.

Thanks in advance.

@aphyr
Copy link
Collaborator

aphyr commented Aug 16, 2018

Having two definitions for createReentrantMutex looks a little unusual to me--I think normally this would be (defn createReentrantMutex ([] ...) ([arg1 arg2 arg3] ...)), but if it compiles and you're not getting arity exceptions, maybe it works! This miiiight be due to an infinite state space interacting poorly with model memoization code that runs before the analysis begins; we normally test with finite, compact spaces, and this one is countably infinite, so that miiiight be what's going on. Can you get a thread dump of the process using jstack or similar?

@metanet
Copy link
Author

metanet commented Aug 16, 2018

Thanks for the response.

I know it is not idiomatic clojure :) I am a clojure newbie. The code above works fine. The very first model objects are created with the createEmptyReentrantMutex function and then the rest goes on ReentrantMutex and createReentrantMutex.

Btw, I also run this test for 10 secs, only with a few acquire and releases, but still no luck.

I will get back to you with a thread dump.

@metanet
Copy link
Author

metanet commented Aug 17, 2018

Here is a thread dump for another run of the same test.
dump2.txt

@aphyr
Copy link
Collaborator

aphyr commented Aug 17, 2018 via email

@metanet
Copy link
Author

metanet commented Aug 17, 2018

that would be very helpful. thanks a lot.

looking forward to get the fix.

@metanet
Copy link
Author

metanet commented Aug 17, 2018

I have been cleaning up my code to give you a reproducer. You can run the test via https://github.com/metanet/jepsen/tree/custom-model if you like.
Just cd into the hazelcast directory and hit lein run test --workload lock --time-limit 20.

@aphyr
Copy link
Collaborator

aphyr commented Aug 17, 2018

Right, I've released 0.3.3-SNAPSHOT which aborts memoization for large state spaces. #25, you may also find this of use! Would y'all like to give that a shot and see if it helps?

@metanet
Copy link
Author

metanet commented Aug 19, 2018

thanks a lot. I will re-run it and ping you back.

@metanet
Copy link
Author

metanet commented Aug 19, 2018

I updated the knossos dependency of jepsen to 0.3.4-SNAPSHOT, re-installed jepsen, and run the test, but still encountering the same issue :(

@metanet
Copy link
Author

metanet commented Aug 27, 2018

ok, I have some good news. your explanation about the search space helped me to simplify my model a little bit. My goal is to test reentrancy, such that a client acquires the lock 2 times, and releases it. However, my model turned out to be too loose for this test. So I made the following simple change to reduce the search space and now the analysis step terminates successfully.

Thanks for your help.

(defrecord ReentrantMutex [owner lockCount]
  Model
  (step [this op]
    (if (nil? (:value op))
      (knossos.model/inconsistent "no owner!")
      (condp = (:f op)
        :acquire (if (and (< lockCount 3) (or (nil? owner) (= owner (:value op)))) ;;; <<<=== CHANGED LINE
                   (createReentrantMutex this (:value op) (inc lockCount) op)
                   (do
                     ;(info (str "cannot acquire owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot acquire")))
        :release (if (or (nil? owner) (not= owner (:value op)))
                   (do
                     ;(info (str "cannot release owner: " owner " lock count: " lockCount " op: " op " this model hash: " (System/identityHashCode this)))
                     (knossos.model/inconsistent "cannot release"))
                   (if (= lockCount 1)
                     (createReentrantMutex this nil 0 op)
                     (createReentrantMutex this owner (dec lockCount) op))))))

@metanet metanet closed this as completed Aug 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants