-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Submission: phonfieldwork #385
Comments
Hello @agricolamz and many thanks for your submission. We are discussing whether the package is in scope and need a bit more information.
|
Hello @annakrystalli! thank you for your response. other packagesIn the README file I've mentiond My package has another perspective. The main goal of the my package is to provide different tools for phonetic/phonological researcher. In order to do it I provided several functions that are not connected to
Summing up: my package is focused on the making process of a research a plain thing. All cutting and merging of sounds is based on It is also worth mentioning that my package is not the only one that creates spectrograms (Sueur 2018) list other options: Similar idea about representation of gathered data (without sound) could be found in the ethic policyHonestly, I didn't thought about it. That doesn't mean that I don't care. This do mean that I need help on this topic and whatever path this submission will take, I will appreciate any help and ideas about it. I will try to list here what I know about the topic. Simple linguistic research or documentation looks like this: researcher record some audio/video files of the speech events and then extract some linguistic features from it. Sometimes it is possible to conduct a research in a big cities with mostly literate population, then consent is not a problem. But there are also a hue amount of work that performed in a "field", where researches go to small settlements with mostly rural population. In most cases it is possible to receive a consent, but it is hard to explain what does it mean. In some countries there are an ethic commission that decides whether it is possible to conduct a research, but in some countries there is no any policy on this question (e. g. in my own country, Russia). Since my package make it easier to publish multiple speech samples that could be a problem, since people could be identified (especially if they are from the small community). What can I do? I can insert in my documentation and the I think that there are several things that participant should agree on:
If participant agrees only on the first one, it would be useful to create an access restriction (e. g. password protection) on generated html viewer. This will limit users to academic circles. Which Another thing that could be done is modulating speech signals in order to reduce an ability of person recognition. I hope, I explained my position and my ideas, but I hope that rOpenSci community can help to elaborate this problem. |
I think this is an interesting question. Essentially this is a tool. Could the tool be used by someone who has not adequately protected their human participants from being identified or who has not obtained voluntary informed consent? Yes, however I don't think there is anything that can reasonably be done about that by the tool maker because that would really require (from my understanding) changes to the data collection process. Complicating this is the issue that while US-based researchers at federally funded institutions are required to comply with the Common Rule as well as the Belmont Report principles it was derived from, researchers in other places are subject to their local regulations. And non-researchers have no such requirements at all. The Common Rule and the Belmont Report do not create bright lines, they talk about balancing competing principles: beneficence, justice and respect for persons are sometimes not in agreement. The harm must be weighed against the benefit. In the case of this kind of linguistic data, as I understand it, the potential harm is very much going to be context-dependent. First there is the issue of participants being identifiable, but, second, there could also be larger political and social issues that we have not thought about (for example when a language or culture is being suppressed). This is why respect for persons requires that all of the potential risks be explained to potential participants before data collection. Overall, I do think that having some kind of notice about the need to obtain voluntary informed consent will at least foreground the issue for researchers. There could be a link to a larger document that describes the range of possible issues in more depth. |
I'm totally agree, but it means that the same kind of note should be put on each package that works with sounds, pictures and video. And for now our concern is only about humans, but we could think on it in a greater scale: imagine that I recorded and published audio with rhinoceros' mating signal that will make it easier for poachers to lure those endangered creatures... This note also need to have antypiracy note, in order to prevent publishing music peaces. Even further: |
Thank you for your feedback @elinw. And thank you for your patience and response @agricolamz. As we encounter new applications involving human subject data, it might take a little longer and input from domain experts like @elinw might be required to help ensure our policy is fit for purpose. So I'll summarise where I think we are at the moment: Overall I think just some more detailed documentation regarding data protection for privacy + the handling data according to the informed consent will be enough to start the review process. In particular drawing attention to what @agricolamz highlighted:
I think you also made some good suggestions regarding encryption. Other simpler options could be warnings, asking for confirmation before performing an action (e.g. publishing data), adding particular files to .gitignore or even scanning data for potentially sensitive information (eg names). I think a full review seems a better way to assess what strategy would be more useful and where throughout a typical workflow. So I suggest we ask reviewers to consider these points during review? So, once some more documentation is added to the README on:
I am happy to pass it on to the handling editor. |
I've added a message to the Since my package is actually a bunch of small functions I don't know how to make a small demo of it, but there is a link to the whole tutorial. I also added to README a small summary and links to other packages. |
Yes I think that the vignette is very good. |
Thanks @agricolamz. @melvidoni will be your handling editor. It would be good to point reviewers to the above discussion to consider while they are reviewing. |
Editor checks:
Editor commentsHello @agricolamz thanks for submitting! I'm Dr Melina Vidoni, and I'll be your handling editor. Here are some basic checks for your package:
And the output for
Can you please fix these and advice when it is done? I will find reviewers after that. Reviewers: @jonkeane and @nikopartanen |
Dear @melvidoni, I have added tests so there is a 82% coverage. I decided to keep long lines greater then 80 characters when there are links (here is an example). There is also one false alarm about |
Hello @agricolamz, thanks for the revision. It is good now. The first reviewer will be @jonkeane. I am searching for a second reviewer now. The review deadline will be updated once I have both reviewers confirmed. |
Wow, Sign Language specialist! @jonkeane, I'm going to extend my package for working with video for sign languages one day. |
Hello 👋 I've already started (a very tiny bit) looking around in phonfieldwork, but I am excited to review it! I haven't done sign language research (or really linguistics in general) in a few years now, but when I saw this package in the issues it piqued my curiosity. If you are thinking about video processing, you might want to checkout signglossR which might already have some of that functionality — I'm sure there's space for two packages and the technical needs for sign vs. spoken language phonetics/phonology is different enough it might make sense to keep them separate as well. Anyway, I'm looking forward to what you've done already 😃 |
Hello! @jonkeane, I knew about this package, but it developed from last time I've seen it! @melvidoni, may I propose my own reviewers? I know it is weird since I know them personally, but in case it is not weird:
|
Hello @agricolamz, thanks for the suggestions. I had contacted another potential reviewer. I'll wait for their response and if it is negative, I'll contact them. Thank you! |
Hello @agricolamz. I am still searching for reviewers. I will provide a review deadline once the second reviewer is found. |
Thank you @jonkeane and @nikopartanen for agreeing to review the package. The review due date is August, Friday 28th |
Package Review
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 10-12
Review CommentsOverall this was really interesting to review and I think {phonfieldwork} shows a lot of promise (and already goes along way!) to help make some of the more repetitive tasks involved with phonetic and phonological work much easier and more reliable. I threw a few elan files that I had from my dissertation research at it and was impressed it handled them easily. I even managed to read in a folder of elan files that ultimately had around 166k annotations in ~30 seconds with I have a few larger/more sweeping comments and ideas for changes that are more architectural or more broad and then at the bottom I went through the package function by function and have specific comments and suggestions about the code. Small note: I did some of this reviewing in two batches somewhat far apart and there were a few pushes to the github repo between the two, so if the line numbers don't match up or make sense, let me know and I'll look at them again and make sure I have the most up to date ones. Large(r) scale commentsThink about making classes+methods for some of the analysis types that get read inThere are a number of places where you detect if an object is a TextGrid (typically by checking that the second element has "TextGrid" in it, examples: You could even make an object type for each of the analysis objects that you read in, and if you did this, you could also consider doing things like making This isn't a requirement and what you have here shows that you don't need this to have the functionality you want, but I think it would clean up some of the duplicated code in the package. Function restructuringA few places you have a pattern of one function that creates some output (like an .Rmd file that is then rendered) I wonder if it would be better to split these functions apart so that you have one function to make the source (.Rmd, .csv, etc.) and then another that lightly wraps + renders + cleans up the source. Something like:
This would give you the best of both worlds: being able to generate the source with a function call, and being able to render directly with one function call without needing to have an argument ( Documentation improvementsSome of the help documents could have larger descriptions describing what they do, more details about how the arguments interact, what the function might be used for, etc. It also might be nice to have some organization on the pkgdown website. Breaking help sections into topics would add a lot to discovery and seeing how the different components of the package fit together. String manipulation and {glue}There's a decent amount of string manipulation that I think might benefit from using the {glue} package. I know adding a dependency might not be worth it, I bet using {glue} would make some of the string manipulations that you do easier to read and operate with. A few general style suggestionsA few places you use right assignment with Though the package satisfies the lintrs specified in {goodpractice}, there are a number of linting rules from {lintr} that aren't followed consistently. This isn't a huge deal, but fixing them (possibly with {styler}) would make even more easily readable. Comments on specific functions/files
with
The Tests
More minor commentsFormattingI think you want an additional line header above this list to format as a list The tutorial link in the readme links to the pkgdown website, which is circular when it is used as the index page for pkgdown. InstallationI needed to install |
Thank you, @jonkeane! |
Tanks @jonkeane for the review! @agricolamz you can start checking it, otherwise, you can wait for @nikopartanen to finish their review. There are still 4 days left. |
Package Review
DocumentationThe package includes all the following forms of documentation:
Functionality
Final approval (post-review)
Estimated hours spent reviewing: 10
Review Comments
How I understand it, the package has several larger areas where it can be used, and which form the pipeline described also in the documentation. 1) Creating stimuli sets for data collection, 2) Organising data collected with these stimuli, 3) Processing and further annotating the annotation files, 4) Reading into R resulting datasets, and 5) Visualizing this data. The package is designed so that the different part of the pipeline work well together. Many of my recommendations relate to the ways how the package could be made more useful also in a little bit different scenarios, where the data is collected differently or the user already has some materials in a different format and wants to take advantage of functionalities of the package later in the pipeline. Creating stimuli and organizing the filesThis functionality is very useful and well thought. There aren't that many tools around for this kind of task. In the package documentation there is a good explanation of the intended workflow, with notes about recording practices. I recorded some test audio with Audacity following the documentation, and ended up with files called One problem I encountered with I understand from the documentation that the user creates the stimuli, shows them to a research participant, records stimuli as individual clips, and then ends up with a list of recordings that is the length of stimuli list, and has thereby already removed accidental double takes. This if fine, and in many cases a really well functioning workflow, and the package does provide a good help in doing this. It can be useful to take a look into SpeechRecorder by BAS, which provides similar functionality. The advantage of this software is that one can record directly within a template, and use a microphone through an audio interface connected into a laptop. The main problem I see in using Maybe it would be useful to add into a stimuli list a clear sequential id number, so the user can easily listen, for example, to the stimulus 75 and check that this is the same as 75th file in their recording directory. I understand the page number in stimuli file probably already fills this need, and maybe it is enough. Additionally, one could also specify in the Processing dataAnnotating TextGrid files works very well, and the functions are well thought and form a clear pipeline. Examples in the documentation are also clear. It seems to me that there is no way to create a TextGrid file besides Separating the folder processing side of Reading data into RReading datasets into R works really well. I tested extensively with both ELAN and Exmaralda corpora, and the resulting data frame is well structured, and the package works without problems with different ELAN files. I must emphasize that these functions worked better with .eaf and .exb files than anything I have seen before, so I'm very happy with them. I used while testing Exmaralda-files from Nganasan Spoken Language Corpus and Selkup Language Corpus. I also used an individual FlexText file I found from here, which belongs to corpus The Documentation of Eibela: An archive of Eibela language materials from the Bosavi region. For testing ELAN I used various files I had at my laptop, as this is very common format it was easier to find different examples. Using ELAN files was also entirely free from adventures. With some Exmaralda files I did encounter problems, this should repeat my error:
Which results in:
Maybe there could be some generic error message that prints the path of the file and tells the file cannot be parsed, and then shows the underlying error message. In a typical corpus creation workflow the files we fail to parse are usually a very good indicator for the files that need to be checked. So that some files fail to parse is just typical, and no problem necessarily in itself. Especially if the file is broken it is better to fail than reading it wrong. I assume in some corpora there are just very unusual file structures, possibly related to old conversions from a format to format, so it is impossible to predict everything that is inside those files. But there may be some patterns that are so common that checking what is going on can improve the function. I also encountered one FlexText file that I couldn't parse, which is here. This file doesn't seem to contain the word-level annotations, so that it doesn't work is not surprising, of course. If this kind of files are more common (I have no ideas, FlexText files are surprisingly hard to find!), it could help to have a specific error or warning message for this. I think most of the time after the user reads these files into R, they would need to filter it a bit and probably use
This way we end up with a nice data frame that has tokens, POS-tags and morphological analysis organized so that we have one observation (token now, could also be the utterance) per one row. It is impossible, of course, to predict which tiers the user wants to work with, which is why simple examples that give a direction to further possibilities could be useful. Or just to mention that Note: These corpora are distributed under CC BY-NC-SA license, so using some other examples in the package vignette is certainly recommendable, as the license is not compatible with that of the package. I assume using these files in my review and testing is without problems. I originally wanted to include a screenshot of the resulting dataframe, but I excluded it now in order not to have actual corpus content shown. I agree with the other reviewer that reading individual files and reading all the files in the directory should be somehow separated. Especially since when the user reads in whole directories there should be good control over things like finding the files recursively or not etc. VisualizationsThese work well, and as described. I'm also personally extremely happy to be able to create good-looking spectrograms in R. These are just small suggestions that possibly could improve the usability. Adding the annotations from a data frame into a visualization made with As a technical note, using a tibble instead a data.frame seems to give an error here. I think it would be important that the package would not make a difference between data.frame and tbl_df classes. Bonus points for Raven style annotations, that's a great addition. Further commentsI think What it comes to the HTML-export, it could be worth checking out Leipzig.js JavaScript library, as the glosses are much more pretty when aligned correctly. However, especially Word export is something I've heard wished for a lot, so it's nice to see it exists and works well here. |
Thank you, @nikopartanen! |
You are more than welcome, @agricolamz! It was a pleasure to dive deeper into this package, and I haven't probably even got that far yet with it, but today was the deadline so I wanted to respect that. Have a good weekend! |
Thanks @nikopartanen for the review! |
Hello @agricolamz, are there any updates in your reply to reviewers? |
only small updates, but I will make more till the end of this week |
Dear @jonkeane and @nikopartanen, I've made a lot of changes. Thank you again for your comments. I'm mostly finished. The only big changes that left are
I will finish it during this week. Here are my answers to some of your comments (for most of them, I just change everything according to them). For @jonkeanepandoc-citeproc
It looks like that pandoc-citeproc is a dependency of Think about making classesI will keep it in mind, but for the time being I created a function Function restructuringI need more time to think about it. Documentation improvementsI added sections to String manipulation and {glue}I don't want an additional dependencies. But may be I will change my opinion in the future. srt_to_df()
I'm not sure what kind of names could be used here, since the common Tests
For some reason it doesn't work for me (I checked both with Comments on specific functions/files
|
So now in order to separate functions for reading individual files and reading all the files in the directory I created a function |
Hello @jonkeane and @nikopartanen. Please, take a look at the submitted revisions, following our guide. Let me know in about 3 weeks at the most if @agricolamz needs to complete another revision. |
Hello again, I’ve reviewed the changes and the responses and I’m satisfied with both. For the encoding issues, I was using some of the files I found in http://www.ims.uni-stuttgart.de/phonetik/helps/praat-scripting/scripts.zip that gave me some of the encoding issues I mentioned. |
Thanks @jonkeane. @nikopartanen did you had time to look at the changes? |
Hello! Sorry for the delay! I tested the changes quite thoroughly last week when I used the package in one research task of my own, and I'm very satisfied with the way the package works now and how the ideas for changes were addressed. Thanks for your great work, @agricolamz! |
Since both reviewers approved... Approved! Thanks @agricolamz for submitting and @jonkeane @nikopartanen for your reviews! To-dos:
Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions. We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here. |
@melvidoni, could you give me an admin access? I think that I've done the rest (except a blog post). |
You should have access now @agricolamz! |
Dear @stefaniebutland, I would like to make a blogpost about phonfieldwork. May I get a publication date? |
Excellent @agricolamz. There are a couple of options for publication dates, depending on how urgent it is for you.
Blog post drafts are reviewed by rOpenSci Community Assistant @steffilazerte. Since your lingtypology post we have written https://blogguide.ropensci.org/ with detailed content and technical guidelines. |
@stefaniebutland, thank you! I have already written my first draft, so I choose the second option. |
@stefaniebutland thank you! I have already wrote a blogpost, so, I'd take the second option. Created a pull request. |
Submitting Author: George Moroz (@agricolamz)
Repository: https://github.com/agricolamz/phonfieldwork
Version submitted: 0.0.8
Editor: @melvidoni
Reviewer 1: @jonkeane
Reviewer 2: @nikopartanen
Archive: TBD
Version accepted: 2020-10-20
Scope
Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):
Explain how and why the package falls under these categories (briefly, 1-2 sentences):
phonfieldowrks is a tool that helps researchers create sound annotation viewer from multiple sounds. I believe that sound annotation viewer helps to make phonetic more reproducible, since it make sound data widely availible. It helps reducing the time that annotators spent during annotations in Praat, Elan and EXMARaLDA, so it partially replace its functionality.
Who is the target audience and what are scientific applications of this package?
Linguists, phoneticians. Probably in the future specialists from bioacustics.
Are there other R packages that accomplish the same thing? If so, how does yours differ or meet our criteria for best-in-category?
There are packages 'rPraat' and 'textgRid' https://CRAN.R-project.org/package=textgRid, that overlap in functionality with my package, but my package has wider coverage of tasks and especially created for specific workflow described in the docs (https://agricolamz.github.io/phonfieldwork/).
If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.
Technical checks
Confirm each of the following by checking the box.
This package:
Publication options
JOSS Options
paper.md
matching JOSS's requirements with a high-level description in the package root or ininst/
.MEE Options
Code of conduct
The text was updated successfully, but these errors were encountered: