Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find publications that cite your data #13

Open
twhiteaker opened this issue Apr 5, 2023 · 15 comments
Open

find publications that cite your data #13

twhiteaker opened this issue Apr 5, 2023 · 15 comments

Comments

@twhiteaker
Copy link
Contributor

On the publications page, we could add a section on how to discover publications that cite an LTER site's data. I don't know what the best way to do this is. Here's what BLE has documented, though we don't actively follow these ideas:

https://github.com/BLE-LTER/ble-handbook/blob/main/handbook.md#data-citations

@marty-downs
Copy link
Member

I might condense the first couple of approaches -- which it seems like you've rejected as impractical anyway -- but, yes, I think this would be useful.

@mbjones
Copy link
Member

mbjones commented Apr 5, 2023

@twhiteaker we've focused on this issue a lot, and have several complementary approaches to gathering data citations together into a common resource, which we compile at the DataONE citation and metrics service, and send to the DataCite EventData service. The DataONE metrics service can be used to get the information we have compiled and scraped from these varied sources. For example, here's a screenshot of the Arctic Data Center portal showing, for each publication, which data sets they cite (sometimes more than one):

image

We also provide mechanisms for researchers to notify us that they cited a dataset, and we query a number of sources regularly. Althea Marks worked last year for us on evaluating completeness (or lack thereof) of these citation compilation efforts, and I can share her AGU poster on the topic if you have interest. Let me know if you'd like to chat further about this -- I'd love to see cross-community approaches that make this easier for everyone.

@mbjones
Copy link
Member

mbjones commented Apr 5, 2023

Oh, and if you want to see an LTER-related example of this, check out the Toolik Lake portal: https://search.dataone.org/portals/toolik/Metrics

@twhiteaker
Copy link
Contributor Author

@mbjones It seems like leveraging the DataONE metrics service would be the way to go for LTER sites. Can we access it directly or do we need some magic DataONE member mojo?

Is version 0.0.2 the latest documentation? Access to the service looks more complicated than just formulating a URL. But if you can enable/teach me to use the service for this LTER use case, then I'm willing to write it up for this IM manual.

I think what we'd want for a given LTER site is a table of publications that cite their data. There would be columns for the publication and columns for the cited dataset. The columns could simply by publication DOI and data DOI, but author, title, and year for each would also probably be helpful. I think in many cases you could filter by package identifier with the site acronym, e.g., knb-lter-ble, but that may not always be the case.

BLE uses EDI's dataset landing pages to add publications that cite data packages. When we do that, does that information make it into the DataONE metrics?

@mbjones
Copy link
Member

mbjones commented Apr 5, 2023

Hey @twhiteaker -- I think the most recent version of our docs is here: https://app.swaggerhub.com/apis/nenuji/data-metrics/1.0.0.5, but we haven't fully documented the metrics service as much as our other services, so things could be incomplete. I will cc Rushiraj @nenuji who is the main author of the docs to see if he has any comments.

DataONE draws from a number of sources, but we both pull from the DataCite EventData service and push reliable citations back to EventData. So, if EDI is publishing their citations to EventData as well, I think they should show up in the dataone service, probably with a lag.

You can see an example metrics query by inspecting the network calls made by the DataONE portal service in your browser. Here's an example request made for the Toolik portal, which is what is used to construct the screenshot I included earlier (and other stuff on that page):

https://logproc-stage-ucsb-1.test.dataone.org/metrics?metricsRequest={"metricsPage":{"total":0,"start":0,"count":0},"metrics":["citations","downloads","views"],"filterBy":[{"filterType":"portal","values":["toolik"],"interpretAs":"list"},{"filterType":"month","values":["01/01/2012","04/05/2023"],"interpretAs":"range"},{"filterType":"query","values":["((northBoundCoord:[*  TO 69.1] AND southBoundCoord:[68.2 TO *] AND eastBoundCoord:[* TO  \\-148.5] AND westBoundCoord:[\\-150.7 TO *])) AND (-obsoletedBy:* AND  formatType:METADATA)"],"interpretAs":"list"}],"groupBy":["month"]}

That gets all of the datasets associated with a portal, but you can also use "filterType":"catalog" to do a broader query against any datasets in DataONE.

The request is specified in JSON, as outlined in the SWAGGER docs linked above. We're happy to chat about the details over in the DataONE slack (https://slack.dataone.org) in the #dev-general channel if you'd like more of a conversation on it.

@cgries
Copy link
Contributor

cgries commented Apr 5, 2023

EDI has recently started to report citations to DataCite. So, if DataONE reports there too, it would probably easier to query their API.

But what is the goal for the IM manual? Aren't you trying to find more citations. I.e., in addition to the ones that are already linked through efforts by EDI and DataONE? At least that was my impression when reading your BLE document.

@twhiteaker
Copy link
Contributor Author

@cgries The goal is to find all data use citations for a given LTER site. For sites archiving solely with DataONE member nodes, utilizing the metrics service is the most practical solution. I don't think IMs have time to go searching for citations themselves. If they do, and if they find a scriptable solution that gets citations not in the metrics service, then they can share their script with DataONE which helps everyone.

If a site archives data outside of a DataONE member node, then I guess they're on their own, though the manual could still provide some guidance to at least get started.

@twhiteaker
Copy link
Contributor Author

It sounds like DataONE pushes results to the DataCite EventData service. To support cases when a dataset isn't in DataONE, maybe it makes more sense for me to query DataCite. Sound right?

@mbjones
Copy link
Member

mbjones commented Apr 10, 2023

Yes, you could query EventData directly (and scythe provides a nice R package for doing that if you would like).

Note that, in theory DataCite only supports citations to DOI-bearing objects, whereas DataONE can store citations to objects with any identifier type. We had a TODO item in Make Data Count to support other identifier types as well, but that has not yet materialized as far ask I know. I do think it is still on DataCite's radar though for the new open citation service they recently announced.

@twhiteaker
Copy link
Contributor Author

I made an example Python project which gets citations using DataCite. The demo code produces a CSV file of all data citations for datasets under a given LTER site (via the scope).

It wound up being more complicated than I thought. I'm querying EDI's PASTA to get all BLE LTER's datasets DOIs, then DataCite to get the citations which just gives me a DOI, then Crossref to get metadata (e.g., author, title) on the citations.

What are the chances that (a) EDI will add a reporting feature where I can get citations for all datasets for a given LTER site (via scope, I presume), or (b) DataONE will add similar functionality, perhaps based on the LTER site's name or some other filter if we can't filter by scope? If chances are slim, then perhaps the code I wrote can be referenced in the new section of the IM manual about this.

@cgries
Copy link
Contributor

cgries commented Apr 14, 2023

@twhiteaker are you envisioning more than the Journal citation services in the PASTA API? Specifically the List Data Package Citations, which can be run for a scope?

@twhiteaker
Copy link
Contributor Author

Does it work for scope? This gave me no results.
https://pasta.lternet.edu/package/citations/eml/knb-lter-ble

If it works, that would be very convenient. We'd still have to worry about datasets that aren't in EDI, which would be an advantage of going straight to DataCite.

@twhiteaker
Copy link
Contributor Author

twhiteaker commented Apr 14, 2023

@cgries Also this one includes three citations for preprints
https://pasta.lternet.edu/package/citations/eml/knb-lter-ble/3/1
which I removed a while ago via https://portal.edirepository.org/nis/journalCitations.jsp
but which still show up via the citations service and on the dataset landing page
https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-ble.3.1

image

The screenshot above shows that there are only three entries. Ah, I'm recalling now that maybe you added the other citations, and so that's why I don't see them and can't edit them out.

@cgries
Copy link
Contributor

cgries commented Apr 14, 2023

@twhiteaker

  1. no, you'd have loop through dataset identifiers and versions. To make that more convenient would be a request for Mark.
  2. I am not sure I am following. Those citations look like real publications to me, not preprints. And yes, you can't delete what I entered in the interface. I tend to leave preprints, because they usually are open access while the real paper is frequently behind a paywall. And the preprints don't go away.

@twhiteaker
Copy link
Contributor Author

Ah, I now see the benefit of leaving in preprints. For counting citations for NSF reports, I think I'd still leave the preprints out.

For the IM Manual, I'm thinking of suggesting folks use DataCite, and for an example see this Python implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants