Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dyadic attributes and graph manipulation #567

Open
krivit opened this issue Jul 2, 2024 · 6 comments
Open

Dyadic attributes and graph manipulation #567

krivit opened this issue Jul 2, 2024 · 6 comments

Comments

@krivit
Copy link
Member

krivit commented Jul 2, 2024

The problem:

library(ergm)
data(florentine)
## Business as dyad covariate for marriage
flomarriage %n% "business" <- as.matrix(flobusiness)
summary(flomarriage~edgecov("business"))
#> edgecov.business 
#>                8
## Consider a subgraph excluding nodes 1 and 2.  Because the node IDs
## have shifted, the covariate matrix is now shifted relative to the
## sociomatrix:
summary(flomarriage~S(~edgecov("business"), ~3:16))
#> S(3:16)~edgecov.business 
#>                        1
## Correct value
sum((as.matrix(flomarriage) & as.matrix(flobusiness))[3:16,3:16])
#> [1] 16

Approaches

network API

Introduce the concept of a dyad attribute (%d%?) that functions such as add.nodes(), permute.vertexIDs(), and get.inducedSubgraph() understand and modify accordingly. This is probably the most seamless and elegant way. @CarterButts , what do you think?

%ergmlhs% API

It's up to the user to provide ergm() with a list of network attributes subject to adjustment. Then, something along the lines of

flomarriage %n% "business" <- as.matrix(flobusiness)
flomarriage %ergmlhs% "dyadattr" <- c("business")

could inform the S() operator that a subgraph should be taken. This could be automated by providing a %d%<-.network() method that sets the network attribute and updates the %ergmlhs% metadata.

Vertex name matching

edgecov() and dyadcov() could be more clever. Firstly, if the dyadic covariate dimension does not match the current network size, they should at the very least detect that. They could then see if, e.g., the dyadic covariate matrix has row and column names, which could then be used to map onto the current vertex names.

@mbojan
Copy link
Member

mbojan commented Jul 5, 2024

Thumbs up for a network feature.

@krivit
Copy link
Member Author

krivit commented Jul 10, 2024

@CarterButts , what do you think? In principle, we can test the %d% attribute system out in ergm before porting it to network.

@CarterButts
Copy link

Putting that in network is a definite no-go. The reason is this: a network attribute can hold anything, and it can have any meaning. In the specific case you cite, it is understood that the network attribute contains an adjacency matrix whose nodes correspond to the vertices in the underlying graph - so in that case, it is natural to want a node permutation or subgraph operation to also adjust the matrix accordingly. However, in general we don't know that. A network attribute need not be a matrix. If it is a matrix, it need not be an adjacency matrix. If it is an adjacency matrix, it need not correspond to the nodes of the network object that holds it. And if it is such a matrix, it still doesn't follow that the user wants it to be modified. (For instance, what if I specifically stored the structure of the original graph so that, for subgraphs, I could refer back to the original network structure? We can't know what the user has in mind.) So the problem is that this is a special case, and can't be used to drive general behavior.

There actually is a standard way to ensure that the behavior in question is followed: one can store both the graph and the dyadic values as edge attributes, which will then be handled automagically if the network changes. We usually wouldn't do this, because of the overhead from a sparse representation of a non-sparse covariate (you would be adding edges for every non-zero dyad), and the overhead usually isn't worth it. But this does tell network that "this information is dyadic, it should follow the dyads, and if you remove or swap dyads, you have to swap the information." Putting an object in the network with %n% tells network, "this is some arbitrary information that I need to stash, can you keep it for me?" Which is very handy. But network can't know what you are storing there or why, so it can't know how to change it for you (nor even if you want it changed, nor when you want it changed). The price of power is responsibility.

By contrast, it makes a lot of sense to think about building more intelligence into dyadcov and edgecov, since they are able to make much more nuanced guesses about user intent. If I stash something in a network with %n%, I could be doing anything. But if I send a network attribute to edgecov, I'm attesting that I'm passing something with much more restrictive structure, for a much more restrictive purpose. I like the idea of name matching, because that seems very flexible. Indeed, it could allow you to pass your dyadic data in all sorts of forms. I can see two issues. First, there's the question of how much overhead one would have (maybe not much, especially since this is only paid at model setup and models can be reused, but it's a thought). Second, this obviously won't work with bare matrices. A possible (not very well-developed) thought is to allow *cov terms to take an "inclusion" argument, which says which elements of the parent object get used for the actual covariate in the model. Hypothetically, S() and friends could then work by passing that inclusion structure to the *cov terms, rather than by subsetting the covariate data. The "inclusion" argument is also nice in that, if you knew you were fitting a model to a subset of a larger object manually, you could just pass the larger object and subset the covariates that way. But this gets into ergm formula parsing, and I do not pretend to have mastery of the new API (much less how it is working behind the scenes)! Anyway, name matching might be enough, because one can use the default assumption that "no names = 1:1 ordering" (which would be needed anyway, for backwards compatibility). So if you are going to use S() or whatnot, then it's on you to set your vertex names correctly. However, if you aren't, then you can just ignore the whole thing. Sounds fair and balanced.

@krivit
Copy link
Member Author

krivit commented Jul 14, 2024

The reason is this: a network attribute can hold anything, and it can have any meaning. In the specific case you cite, it is understood that the network attribute contains an adjacency matrix whose nodes correspond to the vertices in the underlying graph - so in that case, it is natural to want a node permutation or subgraph operation to also adjust the matrix accordingly. However, in general we don't know that. A network attribute need not be a matrix. If it is a matrix, it need not be an adjacency matrix. If it is an adjacency matrix, it need not correspond to the nodes of the network object that holds it. And if it is such a matrix, it still doesn't follow that the user wants it to be modified. (For instance, what if I specifically stored the structure of the original graph so that, for subgraphs, I could refer back to the original network structure? We can't know what the user has in mind.) So the problem is that this is a special case, and can't be used to drive general behavior.

Hence the proposal to add the %d% operation and corresponding functions. This makes the user's intent clear.

We usually wouldn't do this, because of the overhead from a sparse representation of a non-sparse covariate (you would be adding edges for every non-zero dyad), and the overhead usually isn't worth it.

It's not just the overhead; it's also what you do with the network afterwards, whether in sna, in ergm, and elsewhere, since you would then need to filter out the edges only present because they have a nonzero value of a dyadic attribute.

@CarterButts
Copy link

Ah, I'd misunderstood what you intended with %d%. However, that also raises issues. To restrict it to support adjacency matrices obviates the whole point of the network object: what you are then doing is going to all the trouble to represent your network as a sparse data object, only to stick an adjacency matrix of equal size on it. If you are doing that, it would make as much sense to just use adjacency matrices in the first place, and avoid the trouble. If you allow the attribute to be something else, you now have to worry about how many and which internal formats you will support. And further, you'd need it to have well-defined behavior for hypergraphs and other such objects. There is a way to cover all cases, but that involves....storing the data as edges of a network object. Which is what we already have.

I think that's part of the issue, in the end: there's already a way to do this in the general case, and so anything else one implements is just going to be recreating something equivalent to a network object all over again. I'm also unconvinced that there's a problem to be solved in the first place, since it has always been up to the user to prepare their data before passing it to ergm: if they change their response and don't change their covariates, that's a PBCAK. If the issue is that we want ergm to be able to work on subgraphs of a network object, the natural thing to do is to have ergm (which works with a much more restricted context) decide how it wants to pull that data from the network. That is also likely to be more efficient than modifying the whole storage object.

Given that this is not something for which there has been an evident need in the past, what is the argument for why this is now an important feature?

@krivit
Copy link
Member Author

krivit commented Jul 14, 2024

Given that this is not something for which there has been an evident need in the past, what is the argument for why this is now an important feature?

Terms that operate on a subgraph, mainly. That includes S() operator, but also support for multimode models in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants