diff --git a/.Rbuildignore b/.Rbuildignore index e4075a19..da7110ba 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -18,3 +18,4 @@ R\.Rproj$ ^\.lintr$ .Rprofile$ .Rprofile\.R$ +^special-scripts$ diff --git a/.github/workflows/R-CMD-check.yaml b/.github/workflows/R-CMD-check.yaml index b085ed52..e03a0bb0 100644 --- a/.github/workflows/R-CMD-check.yaml +++ b/.github/workflows/R-CMD-check.yaml @@ -2,9 +2,9 @@ # Need help debugging build failures? Start at https://github.com/r-lib/actions#where-to-find-help on: push: - branches: [main, master, devel] + branches: [main, master, test] pull_request: - branches: [main, master, devel] + branches: [main, master, test] name: R-CMD-check diff --git a/.github/workflows/rhub.yaml b/.github/workflows/rhub.yaml new file mode 100644 index 00000000..74ec7b05 --- /dev/null +++ b/.github/workflows/rhub.yaml @@ -0,0 +1,95 @@ +# R-hub's generic GitHub Actions workflow file. It's canonical location is at +# https://github.com/r-hub/actions/blob/v1/workflows/rhub.yaml +# You can update this file to a newer version using the rhub2 package: +# +# rhub::rhub_setup() +# +# It is unlikely that you need to modify this file manually. + +name: R-hub +run-name: "${{ github.event.inputs.id }}: ${{ github.event.inputs.name || format('Manually run by {0}', github.triggering_actor) }}" + +on: + workflow_dispatch: + inputs: + config: + description: 'A comma separated list of R-hub platforms to use.' + type: string + default: 'linux,windows,macos' + name: + description: 'Run name. You can leave this empty now.' + type: string + id: + description: 'Unique ID. You can leave this empty now.' + type: string + +jobs: + + setup: + runs-on: ubuntu-latest + outputs: + containers: ${{ steps.rhub-setup.outputs.containers }} + platforms: ${{ steps.rhub-setup.outputs.platforms }} + + steps: + # NO NEED TO CHECKOUT HERE + - uses: r-hub/actions/setup@v1 + with: + config: ${{ github.event.inputs.config }} + id: rhub-setup + + linux-containers: + needs: setup + if: ${{ needs.setup.outputs.containers != '[]' }} + runs-on: ubuntu-latest + name: ${{ matrix.config.label }} + strategy: + fail-fast: false + matrix: + config: ${{ fromJson(needs.setup.outputs.containers) }} + container: + image: ${{ matrix.config.container }} + + steps: + - uses: r-hub/actions/checkout@v1 + - uses: r-hub/actions/platform-info@v1 + with: + token: ${{ secrets.RHUB_TOKEN }} + job-config: ${{ matrix.config.job-config }} + - uses: r-hub/actions/setup-deps@v1 + with: + token: ${{ secrets.RHUB_TOKEN }} + job-config: ${{ matrix.config.job-config }} + - uses: r-hub/actions/run-check@v1 + with: + token: ${{ secrets.RHUB_TOKEN }} + job-config: ${{ matrix.config.job-config }} + + other-platforms: + needs: setup + if: ${{ needs.setup.outputs.platforms != '[]' }} + runs-on: ${{ matrix.config.os }} + name: ${{ matrix.config.label }} + strategy: + fail-fast: false + matrix: + config: ${{ fromJson(needs.setup.outputs.platforms) }} + + steps: + - uses: r-hub/actions/checkout@v1 + - uses: r-hub/actions/setup-r@v1 + with: + job-config: ${{ matrix.config.job-config }} + token: ${{ secrets.RHUB_TOKEN }} + - uses: r-hub/actions/platform-info@v1 + with: + token: ${{ secrets.RHUB_TOKEN }} + job-config: ${{ matrix.config.job-config }} + - uses: r-hub/actions/setup-deps@v1 + with: + job-config: ${{ matrix.config.job-config }} + token: ${{ secrets.RHUB_TOKEN }} + - uses: r-hub/actions/run-check@v1 + with: + job-config: ${{ matrix.config.job-config }} + token: ${{ secrets.RHUB_TOKEN }} diff --git a/DESCRIPTION b/DESCRIPTION index 357eaa29..63ec48dc 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -17,11 +17,11 @@ Description: Test published summary statistics for consistency Heathers and Brown, 2019, ). The package also provides infrastructure for implementing new error detection techniques. -License: GPL (>= 3) +License: MIT + file LICENSE Encoding: UTF-8 LazyData: true Roxygen: list(markdown = TRUE) -RoxygenNote: 7.3.1 +RoxygenNote: 7.3.2 Imports: cli, corrr, @@ -33,7 +33,7 @@ Imports: magrittr, methods, purrr, - rlang (>= 1.0.2), + rlang (>= 1.1.0), stats, stringr, tibble, @@ -80,12 +80,16 @@ Collate: 'duplicate-count.R' 'grim-granularity.R' 'grim-map-debug.R' + 'grim-map-seq-debug.R' 'grim-map-seq.R' 'grim-map-total-n.R' 'grim-plot.R' 'grim-stats.R' 'grimmer-map-seq.R' 'grimmer-map-total-n.R' + 'grimmer-rsprite2.R' + 'import-standalone-obj-type.R' + 'import-standalone-types-check.R' 'metadata.R' 'method-audit-seq.R' 'method-audit-total-n.R' diff --git a/LICENSE b/LICENSE new file mode 100644 index 00000000..797f1881 --- /dev/null +++ b/LICENSE @@ -0,0 +1,2 @@ +YEAR: 2024 +COPYRIGHT HOLDER: scrutiny authors diff --git a/LICENSE.md b/LICENSE.md index 175443ce..47604350 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,595 +1,21 @@ -GNU General Public License -========================== - -_Version 3, 29 June 2007_ -_Copyright © 2007 Free Software Foundation, Inc. <>_ - -Everyone is permitted to copy and distribute verbatim copies of this license -document, but changing it is not allowed. - -## Preamble - -The GNU General Public License is a free, copyleft license for software and other -kinds of works. - -The licenses for most software and other practical works are designed to take away -your freedom to share and change the works. By contrast, the GNU General Public -License is intended to guarantee your freedom to share and change all versions of a -program--to make sure it remains free software for all its users. We, the Free -Software Foundation, use the GNU General Public License for most of our software; it -applies also to any other work released this way by its authors. You can apply it to -your programs, too. - -When we speak of free software, we are referring to freedom, not price. Our General -Public Licenses are designed to make sure that you have the freedom to distribute -copies of free software (and charge for them if you wish), that you receive source -code or can get it if you want it, that you can change the software or use pieces of -it in new free programs, and that you know you can do these things. - -To protect your rights, we need to prevent others from denying you these rights or -asking you to surrender the rights. Therefore, you have certain responsibilities if -you distribute copies of the software, or if you modify it: responsibilities to -respect the freedom of others. - -For example, if you distribute copies of such a program, whether gratis or for a fee, -you must pass on to the recipients the same freedoms that you received. You must make -sure that they, too, receive or can get the source code. And you must show them these -terms so they know their rights. - -Developers that use the GNU GPL protect your rights with two steps: **(1)** assert -copyright on the software, and **(2)** offer you this License giving you legal permission -to copy, distribute and/or modify it. - -For the developers' and authors' protection, the GPL clearly explains that there is -no warranty for this free software. For both users' and authors' sake, the GPL -requires that modified versions be marked as changed, so that their problems will not -be attributed erroneously to authors of previous versions. - -Some devices are designed to deny users access to install or run modified versions of -the software inside them, although the manufacturer can do so. This is fundamentally -incompatible with the aim of protecting users' freedom to change the software. The -systematic pattern of such abuse occurs in the area of products for individuals to -use, which is precisely where it is most unacceptable. Therefore, we have designed -this version of the GPL to prohibit the practice for those products. If such problems -arise substantially in other domains, we stand ready to extend this provision to -those domains in future versions of the GPL, as needed to protect the freedom of -users. - -Finally, every program is threatened constantly by software patents. States should -not allow patents to restrict development and use of software on general-purpose -computers, but in those that do, we wish to avoid the special danger that patents -applied to a free program could make it effectively proprietary. To prevent this, the -GPL assures that patents cannot be used to render the program non-free. - -The precise terms and conditions for copying, distribution and modification follow. - -## TERMS AND CONDITIONS - -### 0. Definitions - -“This License” refers to version 3 of the GNU General Public License. - -“Copyright” also means copyright-like laws that apply to other kinds of -works, such as semiconductor masks. - -“The Program” refers to any copyrightable work licensed under this -License. Each licensee is addressed as “you”. “Licensees” and -“recipients” may be individuals or organizations. - -To “modify” a work means to copy from or adapt all or part of the work in -a fashion requiring copyright permission, other than the making of an exact copy. The -resulting work is called a “modified version” of the earlier work or a -work “based on” the earlier work. - -A “covered work” means either the unmodified Program or a work based on -the Program. - -To “propagate” a work means to do anything with it that, without -permission, would make you directly or secondarily liable for infringement under -applicable copyright law, except executing it on a computer or modifying a private -copy. Propagation includes copying, distribution (with or without modification), -making available to the public, and in some countries other activities as well. - -To “convey” a work means any kind of propagation that enables other -parties to make or receive copies. Mere interaction with a user through a computer -network, with no transfer of a copy, is not conveying. - -An interactive user interface displays “Appropriate Legal Notices” to the -extent that it includes a convenient and prominently visible feature that **(1)** -displays an appropriate copyright notice, and **(2)** tells the user that there is no -warranty for the work (except to the extent that warranties are provided), that -licensees may convey the work under this License, and how to view a copy of this -License. If the interface presents a list of user commands or options, such as a -menu, a prominent item in the list meets this criterion. - -### 1. Source Code - -The “source code” for a work means the preferred form of the work for -making modifications to it. “Object code” means any non-source form of a -work. - -A “Standard Interface” means an interface that either is an official -standard defined by a recognized standards body, or, in the case of interfaces -specified for a particular programming language, one that is widely used among -developers working in that language. - -The “System Libraries” of an executable work include anything, other than -the work as a whole, that **(a)** is included in the normal form of packaging a Major -Component, but which is not part of that Major Component, and **(b)** serves only to -enable use of the work with that Major Component, or to implement a Standard -Interface for which an implementation is available to the public in source code form. -A “Major Component”, in this context, means a major essential component -(kernel, window system, and so on) of the specific operating system (if any) on which -the executable work runs, or a compiler used to produce the work, or an object code -interpreter used to run it. - -The “Corresponding Source” for a work in object code form means all the -source code needed to generate, install, and (for an executable work) run the object -code and to modify the work, including scripts to control those activities. However, -it does not include the work's System Libraries, or general-purpose tools or -generally available free programs which are used unmodified in performing those -activities but which are not part of the work. For example, Corresponding Source -includes interface definition files associated with source files for the work, and -the source code for shared libraries and dynamically linked subprograms that the work -is specifically designed to require, such as by intimate data communication or -control flow between those subprograms and other parts of the work. - -The Corresponding Source need not include anything that users can regenerate -automatically from other parts of the Corresponding Source. - -The Corresponding Source for a work in source code form is that same work. - -### 2. Basic Permissions - -All rights granted under this License are granted for the term of copyright on the -Program, and are irrevocable provided the stated conditions are met. This License -explicitly affirms your unlimited permission to run the unmodified Program. The -output from running a covered work is covered by this License only if the output, -given its content, constitutes a covered work. This License acknowledges your rights -of fair use or other equivalent, as provided by copyright law. - -You may make, run and propagate covered works that you do not convey, without -conditions so long as your license otherwise remains in force. You may convey covered -works to others for the sole purpose of having them make modifications exclusively -for you, or provide you with facilities for running those works, provided that you -comply with the terms of this License in conveying all material for which you do not -control copyright. Those thus making or running the covered works for you must do so -exclusively on your behalf, under your direction and control, on terms that prohibit -them from making any copies of your copyrighted material outside their relationship -with you. - -Conveying under any other circumstances is permitted solely under the conditions -stated below. Sublicensing is not allowed; section 10 makes it unnecessary. - -### 3. Protecting Users' Legal Rights From Anti-Circumvention Law - -No covered work shall be deemed part of an effective technological measure under any -applicable law fulfilling obligations under article 11 of the WIPO copyright treaty -adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention -of such measures. - -When you convey a covered work, you waive any legal power to forbid circumvention of -technological measures to the extent such circumvention is effected by exercising -rights under this License with respect to the covered work, and you disclaim any -intention to limit operation or modification of the work as a means of enforcing, -against the work's users, your or third parties' legal rights to forbid circumvention -of technological measures. - -### 4. Conveying Verbatim Copies - -You may convey verbatim copies of the Program's source code as you receive it, in any -medium, provided that you conspicuously and appropriately publish on each copy an -appropriate copyright notice; keep intact all notices stating that this License and -any non-permissive terms added in accord with section 7 apply to the code; keep -intact all notices of the absence of any warranty; and give all recipients a copy of -this License along with the Program. - -You may charge any price or no price for each copy that you convey, and you may offer -support or warranty protection for a fee. - -### 5. Conveying Modified Source Versions - -You may convey a work based on the Program, or the modifications to produce it from -the Program, in the form of source code under the terms of section 4, provided that -you also meet all of these conditions: - -* **a)** The work must carry prominent notices stating that you modified it, and giving a -relevant date. -* **b)** The work must carry prominent notices stating that it is released under this -License and any conditions added under section 7. This requirement modifies the -requirement in section 4 to “keep intact all notices”. -* **c)** You must license the entire work, as a whole, under this License to anyone who -comes into possession of a copy. This License will therefore apply, along with any -applicable section 7 additional terms, to the whole of the work, and all its parts, -regardless of how they are packaged. This License gives no permission to license the -work in any other way, but it does not invalidate such permission if you have -separately received it. -* **d)** If the work has interactive user interfaces, each must display Appropriate Legal -Notices; however, if the Program has interactive interfaces that do not display -Appropriate Legal Notices, your work need not make them do so. - -A compilation of a covered work with other separate and independent works, which are -not by their nature extensions of the covered work, and which are not combined with -it such as to form a larger program, in or on a volume of a storage or distribution -medium, is called an “aggregate” if the compilation and its resulting -copyright are not used to limit the access or legal rights of the compilation's users -beyond what the individual works permit. Inclusion of a covered work in an aggregate -does not cause this License to apply to the other parts of the aggregate. - -### 6. Conveying Non-Source Forms - -You may convey a covered work in object code form under the terms of sections 4 and -5, provided that you also convey the machine-readable Corresponding Source under the -terms of this License, in one of these ways: - -* **a)** Convey the object code in, or embodied in, a physical product (including a -physical distribution medium), accompanied by the Corresponding Source fixed on a -durable physical medium customarily used for software interchange. -* **b)** Convey the object code in, or embodied in, a physical product (including a -physical distribution medium), accompanied by a written offer, valid for at least -three years and valid for as long as you offer spare parts or customer support for -that product model, to give anyone who possesses the object code either **(1)** a copy of -the Corresponding Source for all the software in the product that is covered by this -License, on a durable physical medium customarily used for software interchange, for -a price no more than your reasonable cost of physically performing this conveying of -source, or **(2)** access to copy the Corresponding Source from a network server at no -charge. -* **c)** Convey individual copies of the object code with a copy of the written offer to -provide the Corresponding Source. This alternative is allowed only occasionally and -noncommercially, and only if you received the object code with such an offer, in -accord with subsection 6b. -* **d)** Convey the object code by offering access from a designated place (gratis or for -a charge), and offer equivalent access to the Corresponding Source in the same way -through the same place at no further charge. You need not require recipients to copy -the Corresponding Source along with the object code. If the place to copy the object -code is a network server, the Corresponding Source may be on a different server -(operated by you or a third party) that supports equivalent copying facilities, -provided you maintain clear directions next to the object code saying where to find -the Corresponding Source. Regardless of what server hosts the Corresponding Source, -you remain obligated to ensure that it is available for as long as needed to satisfy -these requirements. -* **e)** Convey the object code using peer-to-peer transmission, provided you inform -other peers where the object code and Corresponding Source of the work are being -offered to the general public at no charge under subsection 6d. - -A separable portion of the object code, whose source code is excluded from the -Corresponding Source as a System Library, need not be included in conveying the -object code work. - -A “User Product” is either **(1)** a “consumer product”, which -means any tangible personal property which is normally used for personal, family, or -household purposes, or **(2)** anything designed or sold for incorporation into a -dwelling. In determining whether a product is a consumer product, doubtful cases -shall be resolved in favor of coverage. For a particular product received by a -particular user, “normally used” refers to a typical or common use of -that class of product, regardless of the status of the particular user or of the way -in which the particular user actually uses, or expects or is expected to use, the -product. A product is a consumer product regardless of whether the product has -substantial commercial, industrial or non-consumer uses, unless such uses represent -the only significant mode of use of the product. - -“Installation Information” for a User Product means any methods, -procedures, authorization keys, or other information required to install and execute -modified versions of a covered work in that User Product from a modified version of -its Corresponding Source. The information must suffice to ensure that the continued -functioning of the modified object code is in no case prevented or interfered with -solely because modification has been made. - -If you convey an object code work under this section in, or with, or specifically for -use in, a User Product, and the conveying occurs as part of a transaction in which -the right of possession and use of the User Product is transferred to the recipient -in perpetuity or for a fixed term (regardless of how the transaction is -characterized), the Corresponding Source conveyed under this section must be -accompanied by the Installation Information. But this requirement does not apply if -neither you nor any third party retains the ability to install modified object code -on the User Product (for example, the work has been installed in ROM). - -The requirement to provide Installation Information does not include a requirement to -continue to provide support service, warranty, or updates for a work that has been -modified or installed by the recipient, or for the User Product in which it has been -modified or installed. Access to a network may be denied when the modification itself -materially and adversely affects the operation of the network or violates the rules -and protocols for communication across the network. - -Corresponding Source conveyed, and Installation Information provided, in accord with -this section must be in a format that is publicly documented (and with an -implementation available to the public in source code form), and must require no -special password or key for unpacking, reading or copying. - -### 7. Additional Terms - -“Additional permissions” are terms that supplement the terms of this -License by making exceptions from one or more of its conditions. Additional -permissions that are applicable to the entire Program shall be treated as though they -were included in this License, to the extent that they are valid under applicable -law. If additional permissions apply only to part of the Program, that part may be -used separately under those permissions, but the entire Program remains governed by -this License without regard to the additional permissions. - -When you convey a copy of a covered work, you may at your option remove any -additional permissions from that copy, or from any part of it. (Additional -permissions may be written to require their own removal in certain cases when you -modify the work.) You may place additional permissions on material, added by you to a -covered work, for which you have or can give appropriate copyright permission. - -Notwithstanding any other provision of this License, for material you add to a -covered work, you may (if authorized by the copyright holders of that material) -supplement the terms of this License with terms: - -* **a)** Disclaiming warranty or limiting liability differently from the terms of -sections 15 and 16 of this License; or -* **b)** Requiring preservation of specified reasonable legal notices or author -attributions in that material or in the Appropriate Legal Notices displayed by works -containing it; or -* **c)** Prohibiting misrepresentation of the origin of that material, or requiring that -modified versions of such material be marked in reasonable ways as different from the -original version; or -* **d)** Limiting the use for publicity purposes of names of licensors or authors of the -material; or -* **e)** Declining to grant rights under trademark law for use of some trade names, -trademarks, or service marks; or -* **f)** Requiring indemnification of licensors and authors of that material by anyone -who conveys the material (or modified versions of it) with contractual assumptions of -liability to the recipient, for any liability that these contractual assumptions -directly impose on those licensors and authors. - -All other non-permissive additional terms are considered “further -restrictions” within the meaning of section 10. If the Program as you received -it, or any part of it, contains a notice stating that it is governed by this License -along with a term that is a further restriction, you may remove that term. If a -license document contains a further restriction but permits relicensing or conveying -under this License, you may add to a covered work material governed by the terms of -that license document, provided that the further restriction does not survive such -relicensing or conveying. - -If you add terms to a covered work in accord with this section, you must place, in -the relevant source files, a statement of the additional terms that apply to those -files, or a notice indicating where to find the applicable terms. - -Additional terms, permissive or non-permissive, may be stated in the form of a -separately written license, or stated as exceptions; the above requirements apply -either way. - -### 8. Termination - -You may not propagate or modify a covered work except as expressly provided under -this License. Any attempt otherwise to propagate or modify it is void, and will -automatically terminate your rights under this License (including any patent licenses -granted under the third paragraph of section 11). - -However, if you cease all violation of this License, then your license from a -particular copyright holder is reinstated **(a)** provisionally, unless and until the -copyright holder explicitly and finally terminates your license, and **(b)** permanently, -if the copyright holder fails to notify you of the violation by some reasonable means -prior to 60 days after the cessation. - -Moreover, your license from a particular copyright holder is reinstated permanently -if the copyright holder notifies you of the violation by some reasonable means, this -is the first time you have received notice of violation of this License (for any -work) from that copyright holder, and you cure the violation prior to 30 days after -your receipt of the notice. - -Termination of your rights under this section does not terminate the licenses of -parties who have received copies or rights from you under this License. If your -rights have been terminated and not permanently reinstated, you do not qualify to -receive new licenses for the same material under section 10. - -### 9. Acceptance Not Required for Having Copies - -You are not required to accept this License in order to receive or run a copy of the -Program. Ancillary propagation of a covered work occurring solely as a consequence of -using peer-to-peer transmission to receive a copy likewise does not require -acceptance. However, nothing other than this License grants you permission to -propagate or modify any covered work. These actions infringe copyright if you do not -accept this License. Therefore, by modifying or propagating a covered work, you -indicate your acceptance of this License to do so. - -### 10. Automatic Licensing of Downstream Recipients - -Each time you convey a covered work, the recipient automatically receives a license -from the original licensors, to run, modify and propagate that work, subject to this -License. You are not responsible for enforcing compliance by third parties with this -License. - -An “entity transaction” is a transaction transferring control of an -organization, or substantially all assets of one, or subdividing an organization, or -merging organizations. If propagation of a covered work results from an entity -transaction, each party to that transaction who receives a copy of the work also -receives whatever licenses to the work the party's predecessor in interest had or -could give under the previous paragraph, plus a right to possession of the -Corresponding Source of the work from the predecessor in interest, if the predecessor -has it or can get it with reasonable efforts. - -You may not impose any further restrictions on the exercise of the rights granted or -affirmed under this License. For example, you may not impose a license fee, royalty, -or other charge for exercise of rights granted under this License, and you may not -initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging -that any patent claim is infringed by making, using, selling, offering for sale, or -importing the Program or any portion of it. - -### 11. Patents - -A “contributor” is a copyright holder who authorizes use under this -License of the Program or a work on which the Program is based. The work thus -licensed is called the contributor's “contributor version”. - -A contributor's “essential patent claims” are all patent claims owned or -controlled by the contributor, whether already acquired or hereafter acquired, that -would be infringed by some manner, permitted by this License, of making, using, or -selling its contributor version, but do not include claims that would be infringed -only as a consequence of further modification of the contributor version. For -purposes of this definition, “control” includes the right to grant patent -sublicenses in a manner consistent with the requirements of this License. - -Each contributor grants you a non-exclusive, worldwide, royalty-free patent license -under the contributor's essential patent claims, to make, use, sell, offer for sale, -import and otherwise run, modify and propagate the contents of its contributor -version. - -In the following three paragraphs, a “patent license” is any express -agreement or commitment, however denominated, not to enforce a patent (such as an -express permission to practice a patent or covenant not to sue for patent -infringement). To “grant” such a patent license to a party means to make -such an agreement or commitment not to enforce a patent against the party. - -If you convey a covered work, knowingly relying on a patent license, and the -Corresponding Source of the work is not available for anyone to copy, free of charge -and under the terms of this License, through a publicly available network server or -other readily accessible means, then you must either **(1)** cause the Corresponding -Source to be so available, or **(2)** arrange to deprive yourself of the benefit of the -patent license for this particular work, or **(3)** arrange, in a manner consistent with -the requirements of this License, to extend the patent license to downstream -recipients. “Knowingly relying” means you have actual knowledge that, but -for the patent license, your conveying the covered work in a country, or your -recipient's use of the covered work in a country, would infringe one or more -identifiable patents in that country that you have reason to believe are valid. - -If, pursuant to or in connection with a single transaction or arrangement, you -convey, or propagate by procuring conveyance of, a covered work, and grant a patent -license to some of the parties receiving the covered work authorizing them to use, -propagate, modify or convey a specific copy of the covered work, then the patent -license you grant is automatically extended to all recipients of the covered work and -works based on it. - -A patent license is “discriminatory” if it does not include within the -scope of its coverage, prohibits the exercise of, or is conditioned on the -non-exercise of one or more of the rights that are specifically granted under this -License. You may not convey a covered work if you are a party to an arrangement with -a third party that is in the business of distributing software, under which you make -payment to the third party based on the extent of your activity of conveying the -work, and under which the third party grants, to any of the parties who would receive -the covered work from you, a discriminatory patent license **(a)** in connection with -copies of the covered work conveyed by you (or copies made from those copies), or **(b)** -primarily for and in connection with specific products or compilations that contain -the covered work, unless you entered into that arrangement, or that patent license -was granted, prior to 28 March 2007. - -Nothing in this License shall be construed as excluding or limiting any implied -license or other defenses to infringement that may otherwise be available to you -under applicable patent law. - -### 12. No Surrender of Others' Freedom - -If conditions are imposed on you (whether by court order, agreement or otherwise) -that contradict the conditions of this License, they do not excuse you from the -conditions of this License. If you cannot convey a covered work so as to satisfy -simultaneously your obligations under this License and any other pertinent -obligations, then as a consequence you may not convey it at all. For example, if you -agree to terms that obligate you to collect a royalty for further conveying from -those to whom you convey the Program, the only way you could satisfy both those terms -and this License would be to refrain entirely from conveying the Program. - -### 13. Use with the GNU Affero General Public License - -Notwithstanding any other provision of this License, you have permission to link or -combine any covered work with a work licensed under version 3 of the GNU Affero -General Public License into a single combined work, and to convey the resulting work. -The terms of this License will continue to apply to the part which is the covered -work, but the special requirements of the GNU Affero General Public License, section -13, concerning interaction through a network will apply to the combination as such. - -### 14. Revised Versions of this License - -The Free Software Foundation may publish revised and/or new versions of the GNU -General Public License from time to time. Such new versions will be similar in spirit -to the present version, but may differ in detail to address new problems or concerns. - -Each version is given a distinguishing version number. If the Program specifies that -a certain numbered version of the GNU General Public License “or any later -version” applies to it, you have the option of following the terms and -conditions either of that numbered version or of any later version published by the -Free Software Foundation. If the Program does not specify a version number of the GNU -General Public License, you may choose any version ever published by the Free -Software Foundation. - -If the Program specifies that a proxy can decide which future versions of the GNU -General Public License can be used, that proxy's public statement of acceptance of a -version permanently authorizes you to choose that version for the Program. - -Later license versions may give you additional or different permissions. However, no -additional obligations are imposed on any author or copyright holder as a result of -your choosing to follow a later version. - -### 15. Disclaimer of Warranty - -THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. -EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES -PROVIDE THE PROGRAM “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER -EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF -MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE -QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE -DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. - -### 16. Limitation of Liability - -IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY -COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS -PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, -INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE -PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE -OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE -WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE -POSSIBILITY OF SUCH DAMAGES. - -### 17. Interpretation of Sections 15 and 16 - -If the disclaimer of warranty and limitation of liability provided above cannot be -given local legal effect according to their terms, reviewing courts shall apply local -law that most closely approximates an absolute waiver of all civil liability in -connection with the Program, unless a warranty or assumption of liability accompanies -a copy of the Program in return for a fee. - -_END OF TERMS AND CONDITIONS_ - -## How to Apply These Terms to Your New Programs - -If you develop a new program, and you want it to be of the greatest possible use to -the public, the best way to achieve this is to make it free software which everyone -can redistribute and change under these terms. - -To do so, attach the following notices to the program. It is safest to attach them -to the start of each source file to most effectively state the exclusion of warranty; -and each file should have at least the “copyright” line and a pointer to -where the full notice is found. - - - Copyright (C) - - This program is free software: you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation, either version 3 of the License, or - (at your option) any later version. - - This program is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - GNU General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program. If not, see . - -Also add information on how to contact you by electronic and paper mail. - -If the program does terminal interaction, make it output a short notice like this -when it starts in an interactive mode: - - Copyright (C) - This program comes with ABSOLUTELY NO WARRANTY; for details type 'show w'. - This is free software, and you are welcome to redistribute it - under certain conditions; type 'show c' for details. - -The hypothetical commands `show w` and `show c` should show the appropriate parts of -the General Public License. Of course, your program's commands might be different; -for a GUI interface, you would use an “about box”. - -You should also get your employer (if you work as a programmer) or school, if any, to -sign a “copyright disclaimer” for the program, if necessary. For more -information on this, and how to apply and follow the GNU GPL, see -<>. - -The GNU General Public License does not permit incorporating your program into -proprietary programs. If your program is a subroutine library, you may consider it -more useful to permit linking proprietary applications with the library. If this is -what you want to do, use the GNU Lesser General Public License instead of this -License. But first, please read -<>. +# MIT License + +Copyright (c) 2024 scrutiny authors + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/NAMESPACE b/NAMESPACE index 98da84b3..1962256c 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -18,7 +18,6 @@ export(as_label) export(as_name) export(audit) export(audit_cols_minimal) -export(audit_list) export(audit_seq) export(audit_total_n) export(before_parens) @@ -51,6 +50,7 @@ export(grim_map) export(grim_map_seq) export(grim_map_total_n) export(grim_plot) +export(grim_probability) export(grim_ratio) export(grim_ratio_upper) export(grim_total) diff --git a/NEWS.md b/NEWS.md index 0b26ef7b..12437ff3 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,45 @@ +# scrutiny 0.5.0 + +The package is now released under the MIT license. + +## Breaking changes + +- The `ratio` column in the output of `grim_map()` and `grim_map_seq()` was replaced by a `probability` column. This means: + + - Numerically, the only difference is that `probability` is zero whenever `ratio` was negative. + + - Conceptually, it is much easier to interpret: it is the probability that a reported mean or percentage of integer data that has a specific number of decimal places but is otherwise random is GRIM-inconsistent with the reported sample size. + + For example, `probability` is `0.6` for a mean of `1.23` and a sample size of `40`. The same is true for any other mean with two decimal places. Thus, a randomly chosen mean with two decimal places, ostensibly derived from integer data, has a 0.6 probability of being GRIM-inconsistent with the reported sample size. + +- In the functions around `grim_ratio()`, the `x` argument must now be a string. This is consistent with `grim_map()`, `unround()`, etc.; and it prevents erroneous results that could previously occur by omitting trailing zeros. + +## New features + +- The `probability` column (see above) is created by a new function, `grim_probability()`. + +## Lifecycle updates + +### Deprecated + +- As a consequence of the above, the `show_prob` argument of `grim_map()` is now deprecated and will be removed in a future version. It no longer has any effect. + +- `grim_ratio_upper()` is deprecated and will be removed in a future version. It no longer seems very interesting (and likely never was), especially now that the GRIM ratio in general has taken a backseat. + +- All 15 (!) functions around `is_subset_of()` are deprecated and will be removed in a future version. In truth, they were always poorly written and widely out of scope for scrutiny. + +### Removed + +All of these had been deprecated since scrutiny 0.3.0: + +- `audit_list()` was removed. + +- The `sep` argument in `restore_zeros()` and `restore_zeros_df()` was removed. + +- The `numeric_only` argument in `duplicate_count()` and `duplicate_detect()` was removed. + +- The `na.rm` argument in `duplicate_count_colpair()` was removed. + # scrutiny 0.4.0 This version brings major performance improvements. Furthermore: diff --git a/R/audit.R b/R/audit.R index b8d65c28..51a4c381 100644 --- a/R/audit.R +++ b/R/audit.R @@ -47,39 +47,6 @@ audit <- function(data) { -#' Summaries in list form -#' @description `r lifecycle::badge("deprecated")` -#' -#' `audit_list()` is deprecated. Use `audit()` instead. -#' -#' It was meant to be used when `audit()` would have returned tibbles that -#' were too wide to be read. However, the output format for `audit()` has now -#' been overhauled, there is no longer a need for `audit_list()`. -#' -#' @return Named list of `audit()`'s results. -#' -#' @keywords internal -#' -#' @export -#' -#' @examples -#' # Only use `audit()` instead: -#' pigs1 %>% -#' grim_map() %>% -#' audit() - - -audit_list <- function(data) { - lifecycle::deprecate_warn( - when = "0.3.0", - what = "audit_list()", - with = "audit()" - ) - as.list(audit(data)) -} - - - #' Summarize output of sequence mappers and total-n mappers #' #' @description `audit_seq()` and `audit_total_n()` summarize the results of @@ -150,9 +117,6 @@ audit_seq <- function(data) { vapply(nrow, integer(1L), USE.NAMES = FALSE) %>% unname() - hits_positions <- df_list %>% - purrr::map(function(x) which(x$consistency)) - if (is.null(dim(data))) { fun <- class(data)[stringr::str_detect(class(data), "_map_seq$")] fun <- fun[fun != "scr_map_seq"] diff --git a/R/before-inside-parens.R b/R/before-inside-parens.R index 3991cd5f..a44f91d4 100644 --- a/R/before-inside-parens.R +++ b/R/before-inside-parens.R @@ -11,6 +11,22 @@ check_length_parens_sep <- function(sep) { } + +#' Match the `sep` keyword to actual separators +#' +#' @description `translate_length1_sep_keywords()` is called within +#' `split_by_parens()` to replace the legal keywords `"parens"`, `"brackets"`, +#' or `"braces"` to the characters they describe. +#' +#' A length-2 `sep` object is returned as it is because it is meant to contain +#' actual (custom) separators, not a keyword. If `sep` is neither length 2 nor +#' any of the keywords from above, an error is thrown. +#' +#' @param sep String (length 1 or 2). +#' +#' @return String (length 1 or 2). +#' +#' @noRd translate_length1_sep_keywords <- function(sep) { check_length_parens_sep(sep) if (length(sep) == 2L) { diff --git a/R/debit-map.R b/R/debit-map.R index da169820..2b7de2e3 100644 --- a/R/debit-map.R +++ b/R/debit-map.R @@ -3,8 +3,8 @@ #' Apply DEBIT to many cases #' #' @description Call `debit_map()` to use DEBIT on multiple combinations of -#' mean, standard deviation, and sample size of binary distributions. Mapping -#' function for [`debit()`]. +#' mean, sample standard deviation, and sample size of binary distributions. +#' Mapping function for [`debit()`]. #' #' For summary statistics, call [`audit()`] on the results. #' diff --git a/R/debit.R b/R/debit.R index ca539444..c9a27a1d 100644 --- a/R/debit.R +++ b/R/debit.R @@ -23,7 +23,7 @@ debit_scalar <- function(x, sd, n, #' The DEBIT (descriptive binary) test #' #' @description `debit()` tests summaries of binary data for consistency: If the -#' mean and the standard deviation of binary data are given, are they +#' mean and the sample standard deviation of binary data are given, are they #' consistent with the reported sample size? #' #' The function is vectorized, but it is recommended to use [`debit_map()`] diff --git a/R/duplicate-count-colpair.R b/R/duplicate-count-colpair.R index 0c3b7191..eef2df21 100644 --- a/R/duplicate-count-colpair.R +++ b/R/duplicate-count-colpair.R @@ -27,7 +27,6 @@ dup_count_pairwise <- function(x, y) { #' @param show_rates Logical. If `TRUE` (the default), adds columns `rate_x` and #' `rate_y`. See value section. Set `show_rates` to `FALSE` for higher #' performance. -#' @param na.rm [[Deprecated]] Missing values are never counted in any case. #' @return A tibble (data frame) with these columns – #' - `x` and `y`: Each line contains a unique combination of `data`'s columns, @@ -70,21 +69,11 @@ dup_count_pairwise <- function(x, y) { # data <- df <- tibble::tibble( # a = c(1, 2, 3, NA, 5), b = c(NA, 3L, 4L, 5L, 6L), c = c(3L, 4L, NA, NA, NA) # ) -# na.rm <- TRUE # ignore <- 3 # show_rates <- TRUE -duplicate_count_colpair <- function(data, ignore = NULL, show_rates = TRUE, - na.rm = deprecated()) { - - if (lifecycle::is_present(na.rm)) { - lifecycle::deprecate_warn( - when = "0.3.0", - what = "duplicate_count_colpair(na.rm)", - details = "Missing values are never counted, so `na.rm` has no effect." - ) - } +duplicate_count_colpair <- function(data, ignore = NULL, show_rates = TRUE) { if (!is.data.frame(data)) { cli::cli_abort("`data` must be a data frame.") diff --git a/R/duplicate-count.R b/R/duplicate-count.R index 4e9aa1f1..e0950a16 100644 --- a/R/duplicate-count.R +++ b/R/duplicate-count.R @@ -15,8 +15,6 @@ #' `"list"`, each `locations` value is a vector of column names, which is #' better for further programming. By default (`"character"`), the column #' names are pasted into a string, which is more readable. -#' @param numeric_only [[Deprecated]] No longer used: All values are coerced to -#' character. #' @return If `x` is a data frame or another named vector, a tibble with four #' columns. If `x` isn't named, only the first two columns appear: @@ -29,12 +27,6 @@ #' The tibble has the `scr_dup_count` class, which is recognized by the #' [`audit()`] generic. -#' @details Don't use `numeric_only`. It no longer has any effect and will be -#' removed in the future. The only reason for this argument was the risk of -#' errors introduced by coercing values to numeric. This is no longer an issue -#' because all values are now coerced to character, which is more appropriate -#' for checking reported statistics. - #' @section Summaries with [`audit()`]: There is an S3 method for the #' [`audit()`] generic, so you can call [`audit()`] following #' `duplicate_count()`. It returns a tibble with summary statistics for the @@ -67,20 +59,10 @@ duplicate_count <- function(x, ignore = NULL, - locations_type = c("character", "list"), - numeric_only = deprecated()) { + locations_type = c("character", "list")) { locations_type <- rlang::arg_match(locations_type) - if (lifecycle::is_present(numeric_only)) { - lifecycle::deprecate_warn( - when = "0.3.0", - what = "duplicate_count(numeric_only)", - details = "It no longer has any effect because all input \ - values are now coerced to character strings." - ) - } - # Convert `x` to a data frame if needed (`x_was_named` will also be checked # further below): x_was_named <- rlang::is_named(x) diff --git a/R/duplicate-detect.R b/R/duplicate-detect.R index 3ad04816..2652c4ae 100644 --- a/R/duplicate-detect.R +++ b/R/duplicate-detect.R @@ -19,9 +19,6 @@ #' newly created columns. #' @param name_class String. Name of the class which the factory-made function #' will add to the tibble that it returns. -#' @param include_numeric_only_arg Logical. Should the factory-made function -#' have the deprecated `numeric_only = TRUE` argument? Only for compatibility -#' with older versions of scrutiny (i.e., before 0.3.0). Default is `FALSE`. #' #' @include utils.R #' @@ -30,37 +27,18 @@ #' @noRd -function_duplicate_cols <- function(code_new_cols, default_end, name_class, - include_numeric_only_arg = FALSE) { +function_duplicate_cols <- function(code_new_cols, default_end, name_class) { code_new_cols <- rlang::enexpr(code_new_cols) - if (include_numeric_only_arg) { - numeric_only_arg <- list(numeric_only = deprecated()) - numeric_only_check <- rlang::expr({ - if (lifecycle::is_present(numeric_only)) { - lifecycle::deprecate_warn( - when = "0.3.0", - what = paste0(name_caller_call(), "(numeric_only)"), - details = "It no longer has any effect because all input \ - values are now coerced to character strings." - ) - } - }) - } else { - numeric_only_arg <- NULL - numeric_only_check <- NULL - } - # --- Start of the manufactured function --- rlang::new_function( args = rlang::pairlist2( - x =, ignore = NULL, colname_end = default_end, !!!numeric_only_arg + x =, ignore = NULL, colname_end = default_end ), body = rlang::expr({ - `!!`(numeric_only_check) # Type checking: if (!rlang::is_vector(x)) { @@ -169,8 +147,6 @@ function_duplicate_cols <- function(code_new_cols, default_end, name_class, #' the test result columns, they will be marked `NA`. #' @param colname_end String. Name ending of the logical test result columns. #' Default is `"dup"`. -#' @param numeric_only [[Deprecated]] No longer used: All values are coerced to -#' character. #' @return A tibble (data frame). It has all the columns from `x`, and to each #' of these columns' right, the corresponding test result column. @@ -224,8 +200,7 @@ duplicate_detect <- function_duplicate_cols( # input values, both from the start forward and from the end backward: code_new_cols = duplicated(x) | duplicated(x, fromLast = TRUE), default_end = "dup", - name_class = "scr_dup_detect", - include_numeric_only_arg = TRUE + name_class = "scr_dup_detect" ) diff --git a/R/function-factory-helpers.R b/R/function-factory-helpers.R index d93443cb..7daae708 100644 --- a/R/function-factory-helpers.R +++ b/R/function-factory-helpers.R @@ -293,51 +293,6 @@ absorb_key_args <- function(data, reported, key_cols_call) { -#' Absorb other arguments (not used) -#' -#' `absorb_other_args()` is intended to work like `absorb_key_args()`, but for -#' any other arguments, and it must be reassigned to `fun`, not to `data`. -#' -#' The function is not currently used because I'm now very skeptical of its -#' utility and even its reliability. It relies on a style of function -#' manipulation that might just be too intricate to use in practice. -#' -#' I originally aimed for `absorb_other_args()` to replace the dots on one level -#' of function calls, thereby enabling their use on another level, but this -#' solution turned out to have been somewhat overengineered. -#' -#' @param fun Function to be applied. -#' @param reported String. Names of the key arguments. -#' -#' @return Function `fun` with modified arguments. -#' -#' @noRd -absorb_other_args <- function(fun, reported) { - - args_excluded <- c("data", "...", reported) - - arg_list <- formals(fun) - arg_list <- arg_list[!(names(arg_list)) %in% args_excluded] - - # Capture any arguments specified by the user of the factory-made function: - call_args <- as.list(rlang::caller_call()) - call_args <- call_args[names(call_args) != ""] - - # Prioritize mapper function defaults before scalar defaults... - args_default <- formals(fun)[names(formals(fun)) %in% names(arg_list)] - args_default <- names(args_default) - - formals(fun)[args_default] <- arg_list[args_default] - - # ... and user-specified arguments before mapper defaults: - formals(fun)[names(formals(fun)) %in% names(call_args)] <- - call_args[names(call_args) %in% names(formals(fun))] - - return(fun) -} - - - #' Check that disabled arguments are not specified #' #' If the user of the function factory specified its `.args_disabled` argument, diff --git a/R/function-map-seq.R b/R/function-map-seq.R index fb9805d7..879656a1 100644 --- a/R/function-map-seq.R +++ b/R/function-map-seq.R @@ -56,6 +56,7 @@ function_map_seq_proto <- function(.fun = fun, .var = var, cols_for_testing_names <- colnames(data)[1L:match("consistency", colnames(data)) - 1L] + # Isolate the columns to be tested that are not the current `var` object: cols_for_testing_names_without_var <- cols_for_testing_names[cols_for_testing_names != var] @@ -314,7 +315,8 @@ function_map_seq <- function(.fun, .var = Inf, .reported, .name_test, var <- reported } - # Create the lower-level testing function via an internal function factory: + # Create the lower-level testing function via an internal function + # factory: map_seq_proto <- function_map_seq_proto( .fun = fun, .name_test = name_test, @@ -326,8 +328,8 @@ function_map_seq <- function(.fun, .var = Inf, .reported, .name_test, ... ) - # Apply the lower-level function to all user-supplied variables (`var`) and - # all cases reported in `data`, or at least the inconsistent ones: + # Apply the lower-level function to all user-supplied variables (`var`) + # and all cases reported in `data`, or at least the inconsistent ones: out <- purrr::map(var, ~ map_seq_proto(data = data, var = .x)) # Remove list-elements that are `NULL`, then check for an early return: @@ -345,8 +347,8 @@ function_map_seq <- function(.fun, .var = Inf, .reported, .name_test, return(tibble::tibble()) } - # Repeat the `var` strings so that they form a vector of the length that is - # the row number of `out`, and that can therefore be added to `out`: + # Repeat the `var` strings so that they form a vector of the length that + # is the row number of `out`, and that can therefore be added to `out`: nrow_out <- vapply(out, nrow, integer(1L), USE.NAMES = FALSE) var <- var %>% purrr::map2(nrow_out, rep) %>% @@ -374,9 +376,9 @@ function_map_seq <- function(.fun, .var = Inf, .reported, .name_test, out <- add_class(out, classes_seq) - # Make sure the "rounding class" (i.e., `"scr_rounding_*"`) has the correct - # value. As this is not naturally guaranteed as in `*_map()` functions, it - # must be done by hand: + # Make sure the "rounding class" (i.e., `"scr_rounding_*"`) has the + # correct value. As this is not naturally guaranteed as in `*_map()` + # functions, it must be done by hand: dots <- rlang::enexprs(...) if (length(dots$rounding) > 0L) { class(out)[stringr::str_detect(class(out), "^scr_rounding_")] <- diff --git a/R/function-map-total-n.R b/R/function-map-total-n.R index 0b20e21d..e82c2107 100644 --- a/R/function-map-total-n.R +++ b/R/function-map-total-n.R @@ -236,10 +236,6 @@ function_map_total_n_proto <- function(.fun, .reported, .reported_orig, .dir, #' .name_test = "GRIM" #' ) -# TODO: ADD LINKS IN THE DOCS VIA SQUARE BRACKETS! GO THROUGH THE FILES -# ALPHABETICALLY, STARTING WITH: function-map.R ALSO CHANGE -# `write_doc_factory_map_conventions()` TO INCLUDE BRACKETS IN THE OUTPUT, THEN -# RE-INSERT THE OUTPUT INTO THE DOCS OF THE THREE FACTORIES! # # Full example inputs: # diff --git a/R/grim-granularity.R b/R/grim-granularity.R index 0266b40e..4f89aece 100644 --- a/R/grim-granularity.R +++ b/R/grim-granularity.R @@ -14,9 +14,7 @@ #' clarity, they are presented as distinct. #' #' The output of `grim_items()` should be whole numbers, because scale items -#' have a granularity of 1. If they differ from the next whole number by more -#' than a numeric `tolerance` (which is determined by the argument by that -#' name), a warning will be shown. +#' have a granularity of 1. #' #' It would be wrong to determine a scale's granularity from the minimal #' distance between two values in a given distribution. This would only @@ -28,9 +26,10 @@ #' @param items Numeric. Number of items composing the scale. Default is 1, #' which will hold for most non-Likert scales. #' @param gran Numeric. Granularity. -#' @param tolerance Numeric. In `grim_items()`, `tolerance` is the maximal -#' amount by which results may differ from being whole numbers. If they exceed -#' that amount, a warning will be shown. +#' @param tolerance Numeric. Any difference between `x` and a truncated version +#' of `x` less than `tolerance` (in the absolute value) will be ignored. The +#' default is close to `1 / (10 ^ 8)`. This avoids errors due to spurious +#' precision in floating-point arithmetic. #' #' @include utils.R #' @@ -61,7 +60,7 @@ grim_granularity <- function(n, items = 1) { - return(1 / (n * items)) + 1 / (n * items) } @@ -69,8 +68,9 @@ grim_granularity <- function(n, items = 1) { #' @export grim_items <- function(n, gran, tolerance = .Machine$double.eps^0.5) { + out <- 1 / (n * gran) - out_is_whole <- is_whole_number(out, tolerance = tolerance) + out_is_whole <- is_whole_number(out, tolerance) if (all(out_is_whole)) { return(out) diff --git a/R/grim-map-seq-debug.R b/R/grim-map-seq-debug.R new file mode 100644 index 00000000..b2489131 --- /dev/null +++ b/R/grim-map-seq-debug.R @@ -0,0 +1,92 @@ + +# TODO: Not really debugging but just exploring if `case` could be a factor, +# with the string versions of the original values being the levels, and the +# current values -- the integers -- receding to the background. This would make +# `case` more informative to look at, and it would lift the requirement for +# `dispersion` to be a linear sequence in order for `audit_seq()` to work. + +# # Example data: +# data <- pigs1 +# x <- NULL +# n <- NULL +# var <- Inf +# dispersion <- 1:5 +# out_min <- "auto" +# out_max <- NULL +# include_reported <- FALSE +# include_consistent <- FALSE + +function (data, x = NULL, n = NULL, var = Inf, dispersion = 1:5, + out_min = "auto", out_max = NULL, include_reported = FALSE, + include_consistent = FALSE, ...) +{ + name_test <- "GRIM" + name_fun <- "grim_map" + reported <- c("x", "n") + name_class <- NULL + args_disabled <- NULL + fun <- grim_map # simplifying here + data <- absorb_key_args(data, reported) + check_factory_dots(fun, name_fun, ...) + args_excluded <- c(reported, args_disabled) + arg_list <- call_arg_list() + arg_list <- arg_list[!names(arg_list) %in% args_excluded] + check_mapper_input_colnames(data, reported) + check_consistency_not_in_colnames(data, name_test) + check_tibble(data) + data <- fun(data, ...) + if (!include_consistent) { + data <- data[!data$consistency, ] + } + if (all(is.infinite(var))) { + var <- reported + } + + # Enter `function_map_seq_proto()` with these variables assigned: + var <- var[1] + # .fun <- fun + # .name_test <- name_test + # .name_class <- name_class + # .dispersion <- dispersion + # .out_min <- out_min + # .out_max <- out_max + # .include_reported <- include_reported + + map_seq_proto <- function_map_seq_proto(.fun = fun, .name_test = name_test, + .name_class = name_class, .dispersion = dispersion, .out_min = out_min, + .out_max = out_max, .include_reported = include_reported, + ...) + out <- purrr::map(var, ~map_seq_proto(data = data, var = .x)) + out[vapply(out, is.null, logical(1L))] <- NULL + if (length(out) == 0L) { + msg_setting <- if (interactive()) { + "`include_consistent = TRUE`" + } else { + "unchecking \"Inconsistent cases only\"" + } + cli::cli_warn(c(`!` = "No inconsistent cases to disperse from.", + i = "Try {msg_setting} to disperse from consistent cases, as well.")) + return(tibble::tibble()) + } + nrow_out <- vapply(out, nrow, integer(1L), USE.NAMES = FALSE) + var <- var %>% purrr::map2(nrow_out, rep) %>% unlist(use.names = FALSE) + out <- out %>% dplyr::bind_rows() %>% dplyr::mutate(var, + n = as.integer(n)) + class_dispersion_ascending <- if (is_seq_ascending(dispersion)) { + NULL + } else { + "scr_map_seq_disp_nonlinear" + } + classes_seq <- c("scr_map_seq", paste0("scr_", tolower(name_test), + "_map_seq"), class_dispersion_ascending) + out <- add_class(out, classes_seq) + dots <- rlang::enexprs(...) + if (length(dots$rounding) > 0L) { + class(out)[stringr::str_detect(class(out), "^scr_rounding_")] <- paste0("scr_rounding_", + dots$rounding) + } + if (!is.list(out$consistency)) { + return(out) + } + `$<-`(out, "consistency", unlist(out$consistency, use.names = FALSE)) +} diff --git a/R/grim-map.R b/R/grim-map.R index 9de38986..ec624913 100644 --- a/R/grim-map.R +++ b/R/grim-map.R @@ -11,7 +11,7 @@ #' Display intermediary numbers from GRIM-testing in columns by setting #' `show_rec` to `TRUE`. #' -#' For summary statistics, call `[audit()`] on the results. +#' For summary statistics, call [`audit()`] on the results. #' #' @param data Data frame with columns `x`, `n`, and optionally `items` (see #' documentation for [`grim()`]. By default, any other columns in `data` will @@ -31,9 +31,9 @@ #' @param show_rec Logical. If set to `TRUE`, the reconstructed numbers from #' GRIM-testing are shown as columns. See section *Reconstructed numbers* #' below. Default is `FALSE`. -#' @param show_prob Logical. If set to `TRUE`, adds a `prob` column that -#' contains the probability of GRIM inconsistency. This is simply the `ratio` -#' column censored to range between 0 and 1. Default is `FALSE`. +#' @param show_prob `r lifecycle::badge("deprecated")` Logical. No longer +#' supported: now, there is always a `probability` column. (It replaces the +#' earlier `ratio` column.) #' @param rounding,threshold,symmetric,tolerance Further parameters of #' GRIM-testing; see documentation for [`grim()`]. #' @param testables_only Logical. If `testables_only` is set to `TRUE`, only @@ -47,8 +47,9 @@ #' @return A tibble with these columns -- #' - `x`, `n`: the inputs. #' - `consistency`: GRIM consistency of `x`, `n`, and `items`. +#' - `probability`: the probability of GRIM inconsistency; see +#' [`grim_probability()`]. #' - ``: any columns from `data` other than `x`, `n`, and `items`. -#' - `ratio`: the GRIM ratio; see [`grim_ratio()`]. #' #' The tibble has the `scr_grim_map` class, which is recognized by the #' [`audit()`] generic. @@ -74,10 +75,10 @@ #' 1. `incons_cases`: number of GRIM-inconsistent value sets. #' 2. `all_cases`: total number of value sets. #' 3. `incons_rate`: proportion of GRIM-inconsistent value sets. -#' 4. `mean_grim_ratio`: average of GRIM ratios. -#' 5. `incons_to_ratio`: ratio of `incons_rate` to `mean_grim_ratio`. +#' 4. `mean_grim_prob`: average probability of GRIM inconsistency. +#' 5. `incons_to_prob`: ratio of `incons_rate` to `mean_grim_prob`. #' 6. `testable_cases`: number of GRIM-testable value sets (i.e., those with a -#' positive ratio). +#' positive `probability`). #' 7. `testable_rate`: proportion of GRIM-testable value sets. #' @include audit.R grim.R manage-extra-cols.R restore-zeros.R @@ -117,7 +118,8 @@ grim_map <- function(data, items = 1, merge_items = TRUE, percent = FALSE, - x = NULL, n = NULL, show_rec = FALSE, show_prob = FALSE, + x = NULL, n = NULL, show_rec = FALSE, + show_prob = deprecated(), rounding = "up_or_down", threshold = 5, symmetric = FALSE, tolerance = .Machine$double.eps^0.5, testables_only = FALSE, extra = Inf) { @@ -129,6 +131,19 @@ grim_map <- function(data, items = 1, merge_items = TRUE, percent = FALSE, items, percent, rounding, threshold, symmetric, tolerance )) + # Warn if the user specified a deprecated argument: + if (lifecycle::is_present(show_prob)) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "grim_map(show_prob)", + details = c( + "The \"probability\" column has replaced the \\ + \"ratio\" column, so now it is always displayed, \\ + and `show_prob` no longer has any effect." + ) + ) + } + # Check if the user specified the arguments named after the key columns, `x` # and `n`. If so, the user-supplied value for that argument will be checked # against the column names of `data`. If the value is one of those column @@ -202,9 +217,13 @@ grim_map <- function(data, items = 1, merge_items = TRUE, percent = FALSE, ) } - # 5.: Compute the GRIM ratios for all of the same value sets via - # `grim_ratio()`, which also gets the `percent` argument passed on to: - ratio <- purrr::pmap_dbl(data_x_n_items, grim_ratio, percent = percent) + # 5.: Compute the GRIM probabilities for all of the same value sets via + # `grim_probability()`, which also gets the `percent` argument passed on to: + probability <- purrr::pmap_dbl( + .l = data_x_n_items, + .f = grim_probability, + percent = percent + ) # 6.-?: Any number of other columns from `data` (via the `other_cols` object). @@ -212,11 +231,11 @@ grim_map <- function(data, items = 1, merge_items = TRUE, percent = FALSE, # Create a tibble with results that also includes extra columns from the input # data frame (`other_cols`) unless the `extra` argument has been set to 0 -- if (is.null(extra)) { - # (Number:) 1 2 3 4 - results <- tibble::tibble(x, n, consistency, ratio) + # (Number:) 1 2 3 4 + results <- tibble::tibble(x, n, consistency, probability) } else { - # (Number:) 1 2 3 4 5(-?) - results <- tibble::tibble(x, n, consistency, ratio, other_cols) + # (Number:) 1 2 3 4 5(-?) + results <- tibble::tibble(x, n, consistency, probability, other_cols) } # In case the user had set `show_rec` to `TRUE` for displaying the @@ -257,63 +276,44 @@ grim_map <- function(data, items = 1, merge_items = TRUE, percent = FALSE, } results <- results %>% - unnest_consistency_cols(col_names, index = TRUE) - - # The minus-1 operation balances the temporary removal of `"consistency"` - # that occurred within `unnest_consistency_cols()`: - index_consistency <- match("consistency", colnames(results)) - 1L - - results <- results %>% - dplyr::relocate(ratio, .after = index_consistency + length(col_names)) - } - - - # If demanded via the `show_prob` argument, add a column that displays the - # probability of GRIM inconsistency. This is simply the GRIM ratio - # left-censored at zero, i.e., with negative values set to zero: - if (show_prob) { - results <- results %>% - tibble::add_column( - prob = dplyr::if_else(ratio < 0, 0, ratio), .after = "ratio" - ) + unnest_consistency_cols(col_names, index = TRUE) %>% + dplyr::relocate(probability, .after = consistency) } - # Add the "scr_grim_map" class that `audit()` will recognize: - results <- results %>% - add_class(c("scr_grim_map", glue::glue("scr_rounding_{rounding}"))) + # Prepare to add the "scr_grim_map" class that `audit()` will recognize, as + # well as a class that is informative about the rounding procedure used for + # reconstructing `x`: + classes_to_add <- c("scr_grim_map", paste0("scr_rounding_", rounding)) # Mediate between `seq_endpoint_df()` or `seq_distance_df()`, on the one hand, # and `seq_test_ranking()`, on the other: if (inherits(data, "scr_seq_df")) { - results <- results %>% - add_class("scr_seq_test") + classes_to_add <- c("scr_seq_test", classes_to_add) } + class(results) <- c(classes_to_add, class(results)) + # If `x` is a percentage, divide it by 100 to adjust its value. The resulting # decimal numbers will then be the values actually used in GRIM-testing; i.e., # within `grim_scalar()`. Also, issue an alert to the user about the # percentage conversion: if (percent) { - dp_original <- decimal_places(results$x) - results$x <- as.numeric(results$x) / 100 + digits_original <- decimal_places(results$x) + results$x <- as.numeric(results$x) / 100 results$x <- results$x %>% - restore_zeros(width = (dp_original + 2L)) %>% + restore_zeros(width = digits_original + 2L) %>% suppressWarnings() - results <- results %>% - add_class("scr_percent_true") - - cli::cli_alert_info( - "`x` converted from percentage" - ) + class(results) <- c("scr_percent_true", class(results)) + cli::cli_alert_info("`x` converted from percentage") } # Finally, return either all of the results or only the GRIM-testable ones: if (testables_only) { - return(dplyr::filter(results, ratio > 0)) + dplyr::filter(results, probability > 0) } else { - return(results) + results } } diff --git a/R/grim-stats.R b/R/grim-stats.R index 8eb8363e..4a995f9e 100644 --- a/R/grim-stats.R +++ b/R/grim-stats.R @@ -1,25 +1,29 @@ #' Possible GRIM inconsistencies #' -#' @description Even without GRIM-testing, means / proportions and sample sizes -#' of granular distributions entail some key data: +#' @description These functions compute statistics related to GRIM-testing. In +#' general, `grim_probability()` is the most useful of them, and it is +#' responsible for the `probability` column in a data frame returned by +#' [`grim_map()`]. #' +#' - `grim_probability()` returns the probability that a reported mean or +#' percentage of integer data that is random except for the number of its +#' decimal places is inconsistent with the reported sample size. For example, +#' the mean 1.23 is treated like any other mean with two decimal places. +#' - `grim_ratio()` is equal to `grim_probability()` unless `grim_ratio()` is +#' negative, which can occur if the sample size is very large. Strictly +#' speaking, this is more informative than `grim_probability()`, but it is +#' harder to interpret. #' - `grim_total()` returns the absolute number of GRIM-inconsistencies that -#' are possible given the mean or percentage's number of decimal places (`D`) -#' and the corresponding sample size. -#' - `grim_ratio()` returns a proportion that is normalized by `10^D`, and -#' therefore comparable across mean or percentage values reported to varying -#' `D`. -#' - `grim_ratio_upper()` returns the upper bound of `grim_ratio()` for a -#' given `D`. +#' are possible given the mean or percentage's number of decimal places and +#' the corresponding sample size. #' #' For discussion, see `vignette("grim")`, section *GRIM statistics*. -#' @param x String or numeric (length 1). Mean or percentage value computed from -#' data with integer units (e.g., mean scores on a Likert scale or percentage -#' of study participants in some condition). *Note*: Numeric inputs don't -#' include trailing zeros, but these are important for GRIM functions. See -#' documentation for `grim()`. +#' @param x String (length 1). Mean or percentage value computed from data with +#' integer units, e.g., mean scores on a Likert scale or percentage of study +#' participants in some condition. It has to be string to capture any trailing +#' zeros. #' @param n Integer. Sample size corresponding to `x`. #' @param items Integer. Number of items composing the mean or percentage value #' in question. Default is `1`. @@ -27,11 +31,12 @@ #' proportion of 100 rather than 1. The functions will then account for this #' fact through increasing the decimal count by 2. Default is `FALSE`. -#' @seealso `grim()` for the GRIM test itself; as well as `grim_map()` for +#' @seealso [`grim()`] for the GRIM test itself; as well as [`grim_map()`] for #' applying it to many cases at once. #' -#' @return Integer or double. The number or proportion of possible GRIM -#' inconsistencies. +#' @return Integer or double. The number of possible GRIM inconsistencies, or +#' their probability for a random mean or percentage with a given number of +#' decimal places. #' #' @references Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM Test: A #' Simple Technique Detects Numerous Anomalies in the Reporting of Results in @@ -44,50 +49,87 @@ #' #' @examples #' # Many value sets are inconsistent here: +#' grim_probability(x = "83.29", n = 21) #' grim_total(x = "83.29", n = 21) -#' grim_ratio(x = "83.29", n = 21) #' #' # No sets are inconsistent in this case... +#' grim_probability(x = "5.14", n = 83) #' grim_total(x = "5.14", n = 83) -#' grim_ratio(x = "5.14", n = 83) #' #' # ... but most would be if `x` was a percentage: +#' grim_probability(x = "5.14", n = 83, percent = TRUE) #' grim_total(x = "5.14", n = 83, percent = TRUE) -#' grim_ratio(x = "5.14", n = 83, percent = TRUE) -# Absolute ---------------------------------------------------------------- - -#' @rdname grim-stats -#' @export +# Relative ---------------------------------------------------------------- -grim_total <- function(x, n, items = 1, percent = FALSE) { +grim_probability <- function(x, n, items = 1, percent = FALSE) { + # Manual check (instead of calling `check_type()`) for performance; this + # function will run a great deal: + if (!is.character(x)) { + cli::cli_abort(c( + "!" = "`x` must be of type character.", + "x" = "It is {an_a_type(x)}." + )) + } digits <- decimal_places_scalar(x) if (percent) digits <- digits + 2L p10 <- 10 ^ digits - as.integer(p10 - (n * items)) + out <- (p10 - n * items) / p10 + dplyr::if_else(out < 0, 0, out) } -# Relative ---------------------------------------------------------------- - #' @rdname grim-stats #' @export - grim_ratio <- function(x, n, items = 1, percent = FALSE) { + check_type(x, "character") digits <- decimal_places_scalar(x) if (percent) digits <- digits + 2L p10 <- 10 ^ digits - (p10 - (n * items)) / p10 + (p10 - n * items) / p10 } +# Absolute ---------------------------------------------------------------- + #' @rdname grim-stats #' @export +grim_total <- function(x, n, items = 1, percent = FALSE) { + check_type(x, "character") + digits <- decimal_places_scalar(x) + if (percent) digits <- digits + 2L + p10 <- 10 ^ digits + as.integer(p10 - (n * items)) +} + + +#' Upper bound for the GRIM ratio +#' +#' @description `r lifecycle::badge("deprecated")` +#' +#' `grim_ratio_upper()` is deprecated because it no longer seems very +#' meaningful. It will be removed in a future version. +#' +#' See [`grim_probability()`] for a more interesting measure. +#' +#' @inheritParams grim-stats +#' +#' @keywords internal +#' +#' @return Numeric. +#' +#' @export grim_ratio_upper <- function(x, percent = FALSE) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "grim_ratio_upper()", + details = "It will be removed in a future version." + ) + check_type(x, "character") grim_ratio(x = x, n = 1, items = 1, percent = percent) } diff --git a/R/grim.R b/R/grim.R index d8ac1354..4ce30331 100644 --- a/R/grim.R +++ b/R/grim.R @@ -24,7 +24,7 @@ grim_scalar <- function(x, n, items = 1, percent = FALSE, show_rec = FALSE, # As trailing zeros matter for the GRIM test, `x` must be a string: if (!is.character(x)) { cli::cli_abort(c( - "!" = "`x` must be a string.", + "!" = "`x` must be of type character.", "x" = "It is {an_a_type(x)}." )) } diff --git a/R/grimmer-map.R b/R/grimmer-map.R index 85aee6da..ab34db84 100644 --- a/R/grimmer-map.R +++ b/R/grimmer-map.R @@ -79,6 +79,19 @@ #' audit() +# data <- pigs5 +# items <- 1 +# merge_items <- TRUE +# x <- NULL +# sd <- NULL +# n <- NULL +# show_reason <- TRUE +# rounding <- "up_or_down" +# threshold <- 5 +# symmetric <- FALSE +# tolerance <- .Machine$double.eps^0.5 + + grimmer_map <- function(data, items = 1, merge_items = TRUE, x = NULL, sd = NULL, n = NULL, show_reason = TRUE, rounding = "up_or_down", diff --git a/R/grimmer-rsprite2.R b/R/grimmer-rsprite2.R new file mode 100644 index 00000000..4988a096 --- /dev/null +++ b/R/grimmer-rsprite2.R @@ -0,0 +1,189 @@ + +# NOTE: This file was taken from the rsprite2 package +# (https://github.com/LukasWallrich/rsprite2/blob/master/R/core-functions.R). +# However, it only contains two functions from the original file. + +# The idea is to take them as a basis for revamping scrutiny's implementation of +# GRIMMER, notably fixing the `items` argument. + +# TODO: run the grimmer-replace-names.R script on the copy. + +# # All the variable names: +# grimmer_scalar +# x +# sd +# n +# items +# digits_x +# digits_sd +# x_real +# sum_real +# n_items +# sd_lower +# sd_upper +# sum_squares_lower +# sum_squares_upper +# integers_possible +# var_predicted +# sd_predicted +# matches_parity +# pass_test1 +# matches_sd +# pass_test3 + + +# Helper function --------------------------------------------------------- + +# Determine minimum and maximum SDs for given scale ranges, N, and x. +sd_bounds_measure <- function(n, x, min_val, max_val, sd_prec = NULL, items = 1) { + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", x)) - 1, 0) + } + + result <- c(-Inf, Inf) + + aMax <- min_val # "aMax" means "value of a to produce the max sd" + aMin <- floor(x*items)/items + bMax <- max(max_val, min_val + 1, aMin + 1) # sanity check (just max_val would normally be ok) + bMin <- aMin + 1/items + total <- round(x * n * items)/items + + poss_values <- max_val + for (i in seq_len(items)) { + poss_values <- c(poss_values, min_val:(max_val-1) + (1 / items) * (i - 1)) + } + poss_values <- sort(poss_values) + + for (abm in list(c(aMin, bMin, 1), c(aMax, bMax, 2))) { + + a <- abm[1] + b <- abm[2] + m <- abm[3] + + + k <- round((total - (n * b)) / (a - b)) + k <- min(max(k, 1), n - 1) # ensure there is at least one of each of two numbers + vec <- c(rep(a, k), rep(b, n - k)) + diff <- sum(vec) - total + + if ((diff < 0)) { + vec <- c(rep(a, k - 1), a + abs(diff), rep(b, n - k)) + } + else if ((diff > 0)) { + vec <- c(rep(a, k), b - diff, rep(b, n - k - 1)) + } + + if (round(x(vec), sd_prec) != round(x, sd_prec) | !all(floor(vec*10e9) %in% floor(poss_values*10e9))) { + stop("Error in calculating range of possible standard deviations") + } + + result[m] <- round(sd(vec), sd_prec) + } + + return(result) +} + + + +# Main function ----------------------------------------------------------- + +grimmer_scalar <- function(x, sd, n, m_prec = NULL, sd_prec = NULL, items = 1, min_val = NULL, max_val = NULL) { + if (is.null(m_prec)) { + m_prec <- max(nchar(sub("^[0-9]*", "", x)) - 1, 0) + } + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", sd)) - 1, 0) + } + + n_items = n * items + + # Applies the GRIM test, and computes the possible x. + sum <- x * n_items + sum_real <- round(sum) + x_real <- sum_real / n_items + + #Checks whether x and sd are within possible range + if (!is.null(min_val) & !is.null(max_val)) { + if (x < min_val | x > max_val) { + warning("The x must be between the scale minimum and maximum") + return(FALSE) + } + sd_bounds <- sd_bounds_measure(n, x, min_val, max_val, sd_prec, items) + if (sd < sd_bounds[1] | sd > sd_bounds[2]) { + warning("Given the scale minimum and maximum, the standard deviation has to be between ", sd_bounds[1], " and ", sd_bounds[2], ".") + return(FALSE) + } + } + # Creates functions to round a number consistently up or down, when the last digit is 5 + round_down <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + floor(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + round_up <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + ceiling(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + # Applies the GRIM test, to see whether the reconstituted x is the same as the reported x (with both down and up rounding) + + consistent_down <- round_down(number = x_real, decimals = m_prec) == x + consistent_up <- round_up(number = x_real, decimals = m_prec) == x + + if (!consistent_down & !consistent_up) { + warning("GRIM inconsistent - so GRIMMER test cannot be run. See ?GRIM_test") + return(FALSE) + } + + # Computes the lower and upper bounds for the sd. + + sd_lower <- ifelse(sd < 5 / (10^(sd_prec+1)), 0, sd - 5 / (10^(sd_prec+1))) + sd_upper <- sd + 5 / (10^(sd_prec+1)) + + # Computes the lower and upper bounds for the sum of squares of items. + + lower_bound <- ((n - 1) * sd_lower^2 + n * x_real^2)*items^2 + upper_bound <- ((n - 1) * sd_upper^2 + n * x_real^2)*items^2 + + # Checks that there is at least an integer between the lower and upperbound + + if (ceiling(lower_bound) > floor(upper_bound)) { + return(FALSE) + } + + # Takes a vector of all the integers between the lowerbound and upperbound + + integers_possible <- ceiling(lower_bound):floor(upper_bound) + + # Creates the predicted variance and sd + + var_predicted <- (integers_possible/items^2 - n * x_real^2) / (n - 1) + sd_predicted <- sqrt(var_predicted) + + # Computes whether one sd_predicted matches the sd (trying to round both down and up) + + sd_rounded_down <- round_down(sd_predicted, sd_prec) + sd_rounded_up <- round_up(sd_predicted, sd_prec) + + matches_sd <- sd_rounded_down == sd | sd_rounded_up == sd + + if (!any(matches_sd)) { + return(FALSE) + } + + # Computes whether there is an integer of the correct parity between the lower and upper bounds. + parity <- sum_real %% 2 + matches_parity <- integers_possible %% 2 == parity + return(any(matches_sd & matches_parity)) + + return(TRUE) +} + diff --git a/R/grimmer.R b/R/grimmer.R index c841bd31..4995fcad 100644 --- a/R/grimmer.R +++ b/R/grimmer.R @@ -8,7 +8,8 @@ # techniques; for example, functions like `reround()` and # `decimal_places_scalar()`. Second, changing the return value to logical, which # is the expected output from the basic implementation of any consistency test -# within scrutiny. Third, adjusting variable names to the tidyverse style guide. +# within scrutiny. Third, adjusting variable names to the tidyverse style guide +# and scrutiny's domain-specific conventions. # Translation of variable names ------------------------------------------- @@ -21,6 +22,9 @@ # SD --> sd # decimals_mean --> digits_x (argument removed; counting internally) # decimals_SD --> digits_sd (argument removed; counting internally) +# realmean --> x_real +# realsum --> sum_real +# effective_n --> n_items # Lsigma --> sd_lower # Usigma --> sd_upper # Lowerbound --> sum_squares_lower @@ -28,22 +32,55 @@ # Possible_Integers --> integers_possible # Predicted_Variance --> var_predicted # Predicted_SD --> sd_predicted +# Matches_Oddness --> matches_parity # FirstTest --> pass_test1 # Matches_SD --> matches_sd (to which a `pass_test2` object was added) # Third_Test --> pass_test3 -# # Example inputs: +# # Example inputs 1: +# x <- "1.03" +# sd <- "0.41" # n <- 40 -# mean <- 1.03 -# SD <- 0.41 # items <- 1 +# show_reason <- TRUE # rounding <- "up_or_down" # threshold <- 5 # symmetric <- FALSE # tolerance <- .Machine$double.eps^0.5 -# decimals_mean <- 2 -# decimals_SD <- 2 +# # decimals_mean <- 2 +# # decimals_SD <- 2 + + +# # Example inputs 2: +# # (actually derived from this distribution: c(1, 1, 2, 3, 3, 4, 4, 4, 4, 5)) +# x <- "3.10" +# sd <- "1.37" +# n <- 10 +# items <- 1 +# show_reason <- TRUE +# rounding <- "up_or_down" +# threshold <- 5 +# symmetric <- FALSE +# tolerance <- .Machine$double.eps^0.5 +# # decimals_mean <- 2 +# # decimals_SD <- 2 + + +# # Example inputs 3: +# # (edge case from `pigs5`) +# x <- "2.57" +# sd <- "2.57" +# n <- 30 +# items <- 1 +# show_reason <- TRUE +# rounding <- "up_or_down" +# threshold <- 5 +# symmetric <- FALSE +# tolerance <- .Machine$double.eps^0.5 +# # decimals_mean <- 2 +# # decimals_SD <- 2 + # Implementation ---------------------------------------------------------- @@ -56,13 +93,13 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, check_type(x, "character") check_type(sd, "character") - # A provisional solution: - if (items != 1) { - cli::cli_warn(c( - "The `items` argument in GRIMMER functions doesn't currently \ - work the way it should." - )) - } + # # A provisional solution: + # if (items != 1) { + # cli::cli_warn(c( + # "The `items` argument in GRIMMER functions doesn't currently \ + # work the way it should." + # )) + # } digits_sd <- decimal_places_scalar(sd) @@ -70,31 +107,30 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, x <- as.numeric(x) sd <- as.numeric(sd) - n_orig <- n - n <- n * items + n_items <- n * items - sum <- x*n - realsum <- round(sum) - real_x <- realsum/n + sum <- x * n_items + sum_real <- round(sum) + x_real <- sum_real / n_items - # GRIM test. It says `x_orig` because the `x` object has been coerced from - # character to numeric, but the original string is needed here. Likewise, it - # says `n_orig` because the `n` object has been multiplied by `items`, so if - # `items` is not zero, taking `n` here would lead to wrong results: - grim_consistency <- grim_scalar( - x = x_orig, n = n_orig, items = items, rounding = rounding, + # GRIM TEST: It says `x_orig` because the `x` object has been coerced from + # character to numeric, but `grim_scalar()` needs the original number-string. + # Similarly, since this function also gets `items` passed down, it needs the + # original `n`, not `n_items`. + pass_grim <- grim_scalar( + x = x_orig, n = n, items = items, rounding = rounding, threshold = threshold, symmetric = symmetric, tolerance = tolerance ) - if (!grim_consistency) { + if (!pass_grim) { if (show_reason) { return(list(FALSE, "GRIM inconsistent")) } return(FALSE) } - p10 <- 10 ^ digits_sd + p10 <- 10 ^ (digits_sd + 1) p10_frac <- 5 / p10 # SD bounds, lower and upper: @@ -107,10 +143,20 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, sd_upper <- sd + p10_frac # Sum of squares bounds, lower and upper: - sum_squares_lower <- (n - 1) * sd_lower ^ 2 + n * real_x ^ 2 - sum_squares_upper <- (n - 1) * sd_upper ^ 2 + n * real_x ^ 2 - - pass_test1 <- !ceiling(sum_squares_lower) > floor(sum_squares_upper) + sum_squares_lower <- ((n - 1) * sd_lower ^ 2 + n * x_real ^ 2) * items ^ 2 + sum_squares_upper <- ((n - 1) * sd_upper ^ 2 + n * x_real ^ 2) * items ^ 2 + + # TEST 1: Check that there is at least one integer between the lower and upper + # bounds (of the reconstructed sum of squares of the -- most likely unknown -- + # values for which `x` was reported as a mean). Ceiling the lower bound and + # flooring the upper bound determines whether there are any integers between + # the two. For example: + # -- If `sum_squares_lower` is 112.869 and `sum_squares_upper` is 113.1156, + # `ceiling(sum_squares_lower)` and `floor(sum_squares_upper)` both return + # `113`, so there is an integer between them, and `<=` returns `TRUE`. + # -- TODO: add an example where there is no integer in between; and thus, + # `pass_test1` is `FALSE`. + pass_test1 <- ceiling(sum_squares_lower) <= floor(sum_squares_upper) if (!pass_test1) { if (show_reason) { @@ -124,7 +170,7 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, integers_possible <- ceiling(sum_squares_lower):floor(sum_squares_upper) # Create the predicted variance and SD: - var_predicted <- (integers_possible - n * real_x ^ 2) / (n - 1) + var_predicted <- (integers_possible / items ^ 2 - n * x_real ^ 2) / (n - 1) sd_predicted <- sqrt(var_predicted) # Reconstruct the SD: @@ -136,11 +182,23 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, symmetric = symmetric ) - # Introduce some numeric tolerance to the SD values before comparing them: + # Introduce a small numeric tolerance to the reported and reconstructed SD + # values before comparing them. This helps avoid false-negative results of the + # comparison (i.e., treating equal values as unequal) that might occur due to + # spurious precision in floating-point numbers. sd <- dustify(sd) sd_rec_rounded <- dustify(sd_rec_rounded) - matches_sd <- dplyr::near(sd, sd_rec_rounded, tol = tolerance) + # Check the reported SD for near-equality with the reconstructed SD values; + # i.e., equality within a very small tolerance. This test is applied via + # `purrr::map_lgl()` because `dustify()` doubled the length of both vectors. + matches_sd <- purrr::map_lgl( + .x = sd, + .f = function(x) any(dplyr::near(x, sd_rec_rounded, tol = tolerance)) + ) + + # TEST 2: If none of the reconstructed SDs matches the reported one, the + # inputs are GRIMMER-inconsistent. pass_test2 <- any(matches_sd[!is.na(matches_sd)]) if (!pass_test2) { @@ -150,14 +208,18 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, return(FALSE) } - # Determine if any integer between the lower and upper bounds has the correct - # parity (the property of being even or odd): - parity_realsum <- realsum %% 2 - parity_integers_possible <- integers_possible %% 2 + # Determine if any integer between the lower and upper bounds has the same + # parity (i.e., the property of being even or odd) as the reconstructed sum: + matches_parity <- sum_real %% 2 == integers_possible %% 2 - matches_parity <- parity_realsum == parity_integers_possible + matches_sd_and_parity <- purrr::map_lgl( + .x = matches_parity, + .f = function(x) any(x & matches_sd) + ) - pass_test3 <- any(matches_sd & matches_parity) + # TEST 3: At least one none of the reconstructed SDs has to match the reported + # one, and the corresponding reconstructed sums have to match in parity. + pass_test3 <- any(matches_sd_and_parity) if (!pass_test3) { if (show_reason) { @@ -166,10 +228,14 @@ grimmer_scalar <- function(x, sd, n, items = 1, show_reason = FALSE, return(FALSE) } + # All the tests were passed if the algorithm reaches this point, so the inputs + # are GRIMMER-consistent: if (show_reason) { - return(list(TRUE, "Passed all")) + list(TRUE, "Passed all") + } else { + TRUE } - return(TRUE) + } diff --git a/R/method-grim-map.R b/R/method-grim-map.R index eb12c1f7..a4017b51 100644 --- a/R/method-grim-map.R +++ b/R/method-grim-map.R @@ -11,19 +11,19 @@ audit.scr_grim_map <- function(data) { # 3. the proportion of GRIM-inconsistent cases: out <- audit_cols_minimal(data, "GRIM") - # 4. the average of GRIM ratios: - mean_grim_ratio <- data %>% - dplyr::summarise(mean_grim_ratio = mean(.data$ratio)) %>% + # 4. the average of GRIM probabilitys: + mean_grim_prob <- data %>% + dplyr::summarise(mean_grim_prob = mean(.data$probability)) %>% as.numeric() # 5. the ratio of the proportion of GRIM-inconsistent cases to the average of - # GRIM ratios: + # GRIM probabilities: incons_rate <- out[[3L]] - incons_to_ratio <- incons_rate / mean_grim_ratio + incons_to_prob <- incons_rate / mean_grim_prob # 6. the number of GRIM-testable cases: testable_cases <- data %>% - dplyr::filter(.data$ratio > 0) %>% + dplyr::filter(.data$probability > 0) %>% nrow() # 7. the proportion of GRIM-testable cases: @@ -32,6 +32,6 @@ audit.scr_grim_map <- function(data) { # Finally, collect all of these values in a resulting tibble -- tibble::tibble( - out, mean_grim_ratio, incons_to_ratio, testable_cases, testable_rate + out, mean_grim_prob, incons_to_prob, testable_cases, testable_rate ) } diff --git a/R/restore-zeros.R b/R/restore-zeros.R index 4f38f9d8..6de16e14 100644 --- a/R/restore-zeros.R +++ b/R/restore-zeros.R @@ -44,8 +44,6 @@ #' @param sep_out Substring that will be returned in the output to separate the #' mantissa from the integer part. By default, `sep_out` is the same as #' `sep_in`. -#' @param sep [[Deprecated]] Use `sep_in`, not `sep`. If `sep` is specified -#' nonetheless, `sep_in` takes on `sep`'s value. #' @param data Data frame or matrix. Only in `restore_zeros_df()`, and instead #' of `x`. #' @param cols Only in `restore_zeros_df()`. Select columns from `data` using @@ -97,26 +95,12 @@ #' restore_zeros_df(starts_with("Sepal"), width = 3) -restore_zeros <- function(x, width = NULL, sep_in = "\\.", sep_out = sep_in, - sep = deprecated()) { +restore_zeros <- function(x, width = NULL, sep_in = "\\.", sep_out = sep_in) { # Make sure no whitespace (from values that already were strings) is factored # into the count: x <- stringr::str_trim(x) - if (lifecycle::is_present(sep)) { - lifecycle::deprecate_warn( - when = "0.1.1", - what = "restore_zeros(sep)", - details = c( - "`sep` was replaced by `sep_in`, which now conflicts with it.", - "If `sep` is still specified, `sep_in` takes on its value." - ) - ) - # If `sep` is specified, `sep_in` must take on its role: - sep_in <- sep - } - # Determine the maximal width to which the mantissas should be padded in # accordance with the `width` argument, the default of which, `NULL`, makes # the function go by the maximal length of already-present mantissas: @@ -181,19 +165,7 @@ restore_zeros <- function(x, width = NULL, sep_in = "\\.", sep_out = sep_in, restore_zeros_df <- function(data, cols = everything(), check_numeric_like = TRUE, check_decimals = FALSE, width = NULL, sep_in = "\\.", sep_out = NULL, - sep = deprecated(), ...) { - - if (lifecycle::is_present(sep)) { - lifecycle::deprecate_warn( - when = "0.3.0", - what = "restore_zeros_df(sep)", - details = c( - "`sep` was replaced by `sep_in`, which now conflicts with it.", - "If `sep` is still specified, `sep_in` takes on its value \ - within `restore_zeros()`." - ) - ) - } + ...) { # Check whether the user specified any "old" arguments: those starting on a # dot. This check is now the only remaining purpose of the `...` dots because diff --git a/R/row-to-colnames.R b/R/row-to-colnames.R index 205f2ced..63908ccd 100644 --- a/R/row-to-colnames.R +++ b/R/row-to-colnames.R @@ -27,7 +27,7 @@ #' #' @include utils.R #' -#' @seealso [`unheadr::mash_colnames()`], a more sophisticated solution to the +#' @seealso `unheadr::mash_colnames()`, a more sophisticated solution to the #' same problem. #' #' @return A tibble (data frame). diff --git a/R/seq-predicates.R b/R/seq-predicates.R index b284b8c2..fe102fc0 100644 --- a/R/seq-predicates.R +++ b/R/seq-predicates.R @@ -24,15 +24,23 @@ index_seq <- function(x) { } - -is_linear <- function(x, tolerance) { - x_seq <- index_seq(x) - x_seq <- dplyr::near(x_seq, min(x_seq), tol = tolerance) - all(x_seq) +is_seq_linear_basic <- function(x) { + if (length(x) < 3L) { + return(TRUE) + } + # As the difference between each successive pair of values must be equal for + # `x` to be a linear sequence, we can take the first pairwise difference and + # test each other difference for equality with it. If any comparison turns out + # unequal, `x` is not a linear sequence. + diff_first <- x[2L] - x[1L] + for (i in 3L:length(x)) { + if (x[i] - x[i - 1L] != diff_first) { + return(FALSE) + } + } + TRUE } - - is_seq_ascending_basic <- function(x) { for (i in 1L:(length(x) - 1L)) { if (x[i + 1L] <= x[i]) { @@ -53,6 +61,14 @@ is_seq_descending_basic <- function(x) { } +# # Test any of the sequence functions interactively: +# x <- c(1, 2, NA, 4) +# tolerance <- .Machine$double.eps^0.5 +# test_linear <- TRUE +# test_special <- NULL +# min_length <- NULL +# args_other <- NULL + # Non-exported workhorse API of all the sequence predicates: is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, @@ -69,13 +85,6 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, if (is_even(length(x))) { return(FALSE) } - # Save the unmodified `x` for a test that is conducted if `x` contains one - # or more `NA` elements: - x_orig <- x - } - - if (all(is.na(x))) { - return(NA) } if (!is_numeric_like(x)) { @@ -90,31 +99,66 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, return(TRUE) } - x_has_na <- any(is.na(x)) + x_has_na <- anyNA(x) if (x_has_na) { - # These two `while`-loops remove all `NA` values from the start and the end - # of `x` because `NA`s at these particular locations have no bearing on - # whether or not `x` might represent a linear sequence: - while (is.na(x[1L])) { - x <- x[-1L] + # Save the unmodified `x` for a test that is conducted if `x` contains one + # or more `NA` elements: + x_orig <- x + n_x_orig <- length(x) + + not_na <- which(!is.na(x)) + + # Need at least three known values: + if (length(not_na) < 3L) { + return(NA) } - while (is.na(x[length(x)])) { - x <- x[-length(x)] + + # # Indices of `NA`s at the start and end of `x`: + # n_na_start <- seq_len(not_na[1L] - 1L) + # n_na_end <- (not_na[length(not_na)] + 1L):n_x_orig + + n_na_start <- match(FALSE, is.na(x_orig)) - 1L + n_na_end <- match(FALSE, rev(is.na(x_orig))) - 1L + + # Remove all `NA` values from the start and the end of `x` because `NA`s at + # these particular locations cannot disprove that `x` is the kind of + # sequence of interest. (They do mean that it cannot be proven, so the + # function will return either `NA` or `FALSE`, depending on other factors.) + x <- x[not_na[1L]:not_na[length(not_na)]] + + if (test_linear && !anyNA(x) && !is_seq_linear_basic(x)) { + return(FALSE) } # If the removal of leading and / or trailing `NA` elements in `x` caused # the central value to shift, or if only one side from among left and right - # had any `NA` values, the original `x` was not symmetrically grouped around - # that value, and hence not a dispersed sequence: - if (!is.null(test_special) && test_special == "dispersed") { - diff_central_index <- - !dplyr::near(args_other$from, x[index_central(x)], tolerance) - one_sided_na <- (!is.na(x_orig[1L])) || (!is.na(x_orig[length(x_orig)])) - if (is_even(length(x)) || diff_central_index || one_sided_na) { - return(FALSE) + # had any `NA` values, the original `x` might not have been symmetrically + # grouped around that value, and hence not a dispersed sequence. It depends + # on whether it could potentially be dispersed, assuming values consistent + # with this idea that are imputed for the `NA`s here. + if ( + !is.null(test_special) && + test_special == "dispersed" && + n_na_start != n_na_end + ) { + unit <- x[2] - x[1] + + # Impute sequences of regularly spaces values that the `NA`s might stand + # for, so that the input could potentially be dispersed around + # `args_other$from`... + seq_na_start <- seq(x[1], length.out = n_na_start, by = -unit) - 1 + seq_na_end <- seq(x[length(x)], length.out = n_na_end, by = unit) + 1 + + seq_imputed <- c(rev(seq_na_start), x, seq_na_end) + + # ...assuming the overall (partly imputed) sequence is still dispersed + # around that value: + if (is_seq_dispersed(seq_imputed, from = args_other$from)) { + return(NA) } + return(FALSE) } # Used within the for loop below to check whether the step size must be @@ -156,7 +200,8 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, # elements, which invariably means that the numbers surrounding the # `NA`s are too far spaced out for there to be a linear sequence. In # either case... - seq_replacement_has_wrong_length <- length(seq_replacement) == 0L || + seq_replacement_has_wrong_length <- + length(seq_replacement) == 0L || length(seq_replacement) > length(index_lower:index_upper) # ...an error is thrown: @@ -181,13 +226,10 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, # in the for loop above -- for linearity: if (test_linear) { x_seq <- index_seq(x) - x_seq <- dplyr::near(x_seq, min(x_seq), tol = tolerance) - pass_test <- all(x_seq) - if (!pass_test) { + pass_test_linear <- all(dplyr::near(x_seq, min(x_seq), tol = tolerance)) + if (!pass_test_linear) { return(FALSE) } - } else { - pass_test <- TRUE } # Interface for the special variant functions: @@ -197,12 +239,13 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, "ascending" = is_seq_ascending_basic(x), "descending" = is_seq_descending_basic(x), "dispersed" = is_seq_dispersed_basic(x, args_other$from, tolerance) + # TODO: MAKE THIS "DISPERSED" TEST ABLE TO ASSUME THAT `NA`S IN `x_orig` + # ARE ACTUALLY DISPERSED VALUES! MAYBE USE `index_central()`'S INTERNAL + # LOGIC AND BUILD UP ON IT. ) - pass_test <- pass_test && pass_test_special - } - - if (!pass_test) { - return(FALSE) + if (!pass_test_special) { + return(FALSE) + } } if (x_has_na) { @@ -252,7 +295,7 @@ is_seq_basic <- function(x, tolerance = .Machine$double.eps^0.5, #' sequence after the `NA`s were replaced by appropriate values? If so, they #' return `NA`; otherwise, they return `FALSE`. -#' @seealso [`validate::is_linear_sequence()`], which is much like +#' @seealso `validate::is_linear_sequence()`, which is much like #' `is_seq_linear()` but more permissive with `NA` values. It comes with some #' additional features, such as support for date-times. diff --git a/R/split-by-parens.R b/R/split-by-parens.R index 4a18c7a0..3482fd9a 100644 --- a/R/split-by-parens.R +++ b/R/split-by-parens.R @@ -36,7 +36,7 @@ #' - [`before_parens()`] and [`inside_parens()`] take a string vector and #' extract values from the respective position. #' - [`dplyr::across()`] powers the application of the two above functions -#' within split_by_parens()`, including the creation of new columns. +#' within `split_by_parens()`, including the creation of new columns. #' - [`tidyr::separate_wider_delim()`] is a more general function, but it does #' not recognize closing elements such as closed parentheses. diff --git a/R/subset-superset.R b/R/subset-superset.R index 7fc084ba..a734ced9 100644 --- a/R/subset-superset.R +++ b/R/subset-superset.R @@ -1,9 +1,17 @@ #' Test for subsets, supersets, and equal sets #' -#' @description Predicate functions that take a vector and test whether it has -#' some particular relation to another vector. That second vector is entered -#' in either of three ways -- +#' @description `r lifecycle::badge("deprecated")` +#' +#' All of these functions are deprecated and will be removed in a future +#' version. They are a poor fit for the problem they try to solve, and they +#' are far out of scope for the package. +#' +#' ## Original documentation +#' +#' Predicate functions that take a vector and test whether it has some +#' particular relation to another vector. That second vector is entered in +#' either of three ways -- #' #' \strong{Enter it directly (basic functions):} #' @@ -57,6 +65,8 @@ #' #' @export #' +#' @keywords internal +#' #' @name subset-superset #' #' @examples @@ -103,6 +113,11 @@ #' @export is_subset_of <- function(x, y) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_subset_of()", + details = "It will be removed in a future version." + ) x_rest <- x[!x %in% y] length(x_rest) == 0L } @@ -112,6 +127,11 @@ is_subset_of <- function(x, y) { #' @export is_superset_of <- function(x, y) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_superset_of()", + details = "It will be removed in a future version." + ) y_rest <- y[!y %in% x] length(y_rest) == 0L } @@ -121,6 +141,11 @@ is_superset_of <- function(x, y) { #' @export is_equal_set <- function(x, y) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_equal_set()", + details = "It will be removed in a future version." + ) is_subset_of(x, y) && is_superset_of(x, y) } @@ -129,6 +154,11 @@ is_equal_set <- function(x, y) { #' @export is_proper_subset_of <- function(x, y) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_subset_of()", + details = "It will be removed in a future version." + ) is_subset_of(x, y) && !is_superset_of(x, y) } @@ -137,6 +167,11 @@ is_proper_subset_of <- function(x, y) { #' @export is_proper_superset_of <- function(x, y) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_superset_of()", + details = "It will be removed in a future version." + ) is_superset_of(x, y) && !is_subset_of(x, y) } @@ -148,6 +183,11 @@ is_proper_superset_of <- function(x, y) { #' @export is_subset_of_vals <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_subset_of_vals()", + details = "It will be removed in a future version." + ) y <- rlang::enexprs(...) is_subset_of(x, y) } @@ -158,6 +198,11 @@ is_subset_of_vals <- function(x, ...) { #' @export is_superset_of_vals <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_superset_of_vals()", + details = "It will be removed in a future version." + ) y <- rlang::enexprs(...) is_superset_of(x, y) } @@ -168,6 +213,11 @@ is_superset_of_vals <- function(x, ...) { #' @export is_equal_set_vals <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_equal_set_vals()", + details = "It will be removed in a future version." + ) y <- rlang::enexprs(...) is_equal_set(x, y) } @@ -177,6 +227,11 @@ is_equal_set_vals <- function(x, ...) { #' @export is_proper_subset_of_vals <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_subset_of_vals()", + details = "It will be removed in a future version." + ) y <- rlang::enexprs(...) is_proper_subset_of(x, y) } @@ -186,6 +241,11 @@ is_proper_subset_of_vals <- function(x, ...) { #' @export is_proper_superset_of_vals <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_superset_of_vals()", + details = "It will be removed in a future version." + ) y <- rlang::enexprs(...) is_proper_superset_of(x, y) } @@ -200,6 +260,11 @@ is_proper_superset_of_vals <- function(x, ...) { #' @export is_subset_of_vecs <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_subset_of_vecs()", + details = "It will be removed in a future version." + ) y <- straighten_out(...) is_subset_of(x, y) } @@ -210,6 +275,11 @@ is_subset_of_vecs <- function(x, ...) { #' @export is_superset_of_vecs <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_superset_of_vecs()", + details = "It will be removed in a future version." + ) y <- straighten_out(...) is_superset_of(x, y) } @@ -220,6 +290,11 @@ is_superset_of_vecs <- function(x, ...) { #' @export is_equal_set_vecs <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_equal_set_vecs()", + details = "It will be removed in a future version." + ) y <- straighten_out(...) is_equal_set(x, y) } @@ -229,6 +304,11 @@ is_equal_set_vecs <- function(x, ...) { #' @export is_proper_subset_of_vecs <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_subset_of_vecs()", + details = "It will be removed in a future version." + ) y <- straighten_out(...) is_proper_subset_of(x, y) } @@ -238,6 +318,11 @@ is_proper_subset_of_vecs <- function(x, ...) { #' @export is_proper_superset_of_vecs <- function(x, ...) { + lifecycle::deprecate_warn( + when = "0.5.0", + what = "is_proper_superset_of_vecs()", + details = "It will be removed in a future version." + ) y <- straighten_out(...) is_proper_superset_of(x, y) } diff --git a/R/utils.R b/R/utils.R index 55634888..8bbe215f 100644 --- a/R/utils.R +++ b/R/utils.R @@ -6,7 +6,7 @@ utils::globalVariables(c( "frac", "distance", "both_consistent", "fun", "var", "dispersion", "out_min", "out_max", "include_reported", "n", "times", "value", "name", "setNames", "rounding", "case", "n_sum", "consistency", "ratio", "scr_index_case", - "dust", "starts_with", "value_duplicated", "variable", "sd_lower", + "starts_with", "value_duplicated", "variable", "sd_lower", "sd_incl_lower", "sd_upper", "sd_incl_upper", "x_lower", "x_upper", "dupe_count", "fun_name", # Added after rewriting the function factories using `rlang::new_function()`: @@ -152,30 +152,20 @@ an_a_type <- function(x) { #' Check whether numbers are whole #' -#' @description For each element of a numeric vector `x`, `is_whole_number()` -#' checks whether that element is a whole number. +#' @description For each element of a numeric vector, `is_whole_number()` checks +#' whether that element is a whole number. #' #' This is not the same as the integer data type, so doubles and integers are #' tested the same way. See the note in `?integer`. To test if R itself #' considers a vector integer-like, use `rlang::is_integerish()` instead. #' #' @param x Numeric. -#' @param tolerance Numeric. Any difference between `x` and a truncated version -#' of `x` less than `tolerance` (in the absolute value) will be ignored. The -#' default is close to `1 / (10 ^ 8)`. This avoids errors due to spurious -#' precision in floating-point arithmetic. #' #' @return Logical vector of the same length as `x`. #' -#' @details This function was adapted (with naming modifications) from the -#' examples of `?integer`, where a very similar function is called -#' `is.wholenumber()`. -#' -#' @author R Core Team, Lukas Jung -#' #' @noRd is_whole_number <- function(x, tolerance = .Machine$double.eps^0.5) { - abs(x - round(x)) < tolerance + dplyr::near(x, floor(x), tol = tolerance) } @@ -197,39 +187,6 @@ parcel_nth_elements <- function(x, n, from = 1L) { -#' Drop rows with equivalent values, regardless of their order -#' -#' This function differs from `dplyr::distinct()` mainly insofar as it doesn't -#' take the order of values within a row into account. Thus, it only checks -#' which values are present in a row how many times; hence "equivalent" (see -#' example below the function). -#' -#' @param data Data frame. -#' -#' @return Data frame. Each row is now unique. -#' -#' @examples -#' df <- tibble::tribble( -#' ~a, ~b, ~c, -#' 1, 2, 2, -#' 2, 1, 2, -#' 1, 2, 3 -#' ) -#' -#' No two rows are identical here, so `dplyr::distinct()` preserves all of them. -#' However, `remove_equivalent_rows()` cuts the second row. That is because the -#' first two rows contain all the same values. They only differ in terms of the -#' order of values, which the function ignores. -#' remove_equivalent_rows(df) -#' -#' @noRd -remove_equivalent_rows <- function(data) { - data_array <- apply(data, 1L, sort) - data[!duplicated(data_array, MARGIN = 2L), ] -} - - - #' Switch back and front columns #' #' @param data Data frame @@ -642,28 +599,15 @@ split_into_rows <- function(data) { -#' Step size (stride) of a single decimal number +#' Lowest step size (stride) of decimal numbers #' #' Computes the smallest possible difference between two numbers on the lowest -#' decimal level of `x`. -#' -#' @param x Numeric (or string coercible to numeric). -#' -#' @return Numeric. +#' decimal level of `x`. This goes by the one element of `x` with the most +#' decimal numbers. #' -#' @noRd -step_size_scalar <- function(x) { - digits <- decimal_places_scalar(x) - 1 / (10 ^ digits) -} - - - -#' Lowest step size (stride) of decimal numbers -#' -#' Like `step_size_scalar()` above, but `x` can have any length. The function -#' will compute the step size of the one element of `x` with the most decimal -#' numbers. +#' For example, if `x` is `c(7, 3.5, 8.27)`, the greatest number of decimal +#' places is 2, and the smallest possible difference on the level of two decimal +#' places is `0.01`, so this value is returned. #' #' @param x Numeric (or string coercible to numeric). #' @@ -1159,49 +1103,8 @@ transform_split_parens <- function(data, end1, end2) { #' #' @noRd select_tested_cols <- function(data, before = "consistency") { - index_last_key_col <- match(before, colnames(data)) - 1L - data[1L:index_last_key_col] -} - - - -#' Extract rounding class from an object -#' -#' @description In scrutiny, information about the rounding method used to -#' compute some or all values in a tibble is conveyed via a class which starts -#' on `"scr_rounding_"` and which is inherited by that tibble. -#' -#' This function extracts any such rounding classes from `x`. If it returns -#' multiple classes, something went wrong with `x` earlier on. -#' -#' @param x Any object, but typically a tibble. -#' -#' @return String. Should be length 1, or else there is a problem. -#' -#' @noRd -get_rounding_class <- function(x) { - x_cl <- class(x) - x_cl[stringr::str_detect(x_cl, "^scr_rounding_")] -} - - - -#' Extract bare rounding string -#' -#' Wrapper around `get_rounding_class()` that removes the `"scr_rounding_"` -#' prefix from its output. This might be useful for re-processing the original -#' `rounding` argument's value in `reround()` or in a function that calls -#' `reround()` internally. Not currently used. -#' -#' @param x Any object, but typically a tibble. -#' -#' @return String. Should be length 1, or else there is a problem. See -#' `get_rounding_class()`. -#' -#' @noRd -get_rounding_class_arg <- function(x) { - out <- get_rounding_class(x) - stringr::str_remove(out, "^scr_rounding_") + index_last_tested_col <- match(before, colnames(data)) - 1L + data[1L:index_last_tested_col] } @@ -1236,25 +1139,6 @@ wrap_in_quotes <- function(x) { -#' Wrap into quotation marks if string -#' -#' For error messages and similar. `x` is returned unchanged unless it's a -#' string, in which case it's treated as in `wrap_in_quotes()`. -#' -#' @param x Any object. -#' -#' @return String of length `length(x)`. -#' -#' @noRd -wrap_in_quotes_if_string <- function(x) { - if (is.character(x)) { - x <- paste0("\"", x, "\"") - } - x -} - - - #' Wrap into quotation marks if string, else in backticks #' #' For error messages and similar. Like `wrap_in_quotes_if_string()` except a @@ -1268,11 +1152,10 @@ wrap_in_quotes_if_string <- function(x) { #' @noRd wrap_in_quotes_or_backticks <- function(x) { if (is.character(x)) { - x <- paste0("\"", x, "\"") + paste0("\"", x, "\"") } else { - x <- paste0("`", x, "`") + paste0("`", x, "`") } - x } @@ -1302,22 +1185,6 @@ about_equal <- function(x, y) { -#' Drop all columns with specific string in name -#' -#' All column with names that include `drop_with` are removed. -#' -#' @param data Data frame. -#' @param drop_with String (length 1). -#' -#' @return Data frame. -#' -#' @noRd -drop_cols_with <- function(data, drop_with) { - data[!stringr::str_detect(names(data), drop_with)] -} - - - #' Get name of function being called #' #' Returns the name of the function within which `name_caller_call()` is called @@ -1341,19 +1208,14 @@ name_caller_call <- function(n = 1L, wrap = TRUE) { -# "Dust" variables were used by Nick Brown and later by Lukas Wallrich in -# rsprite2. They get rid of spurious precision in reconstructed decimal numbers: -dust <- 1e-12 - - - #' Subtle variations to numbers #' -#' @description Reduplicate a numeric vector, subtly varying it below and above -#' the original. This avoids issues of spurious precision in floating-point -#' arithmetic. +#' @description Reduplicate a numeric vector, varying it below and above the +#' original by a very small number (`1e-12`). This avoids issues of spurious +#' precision in floating-point arithmetic. #' -#' The difference is the global variable `dust`. +#' Similar "dust" values were previously used by Nick Brown, and later by +#' Lukas Wallrich in rsprite2. #' #' @param x Numeric. #' @@ -1362,11 +1224,11 @@ dust <- 1e-12 #' @details The idea is to catch very minor variation from `x` introduced by #' spurious precision in floating point numbers, so that such purely #' accidental deviations don't lead to false assertions of substantively -#' important numeric difference when there is none. +#' important numeric difference. #' #' @noRd dustify <- function(x) { - c(x - dust, x + dust) + c(x - 1e-12, x + 1e-12) } diff --git a/README.Rmd b/README.Rmd index 8e6a7f99..b9d2122f 100644 --- a/README.Rmd +++ b/README.Rmd @@ -29,9 +29,9 @@ ggplot2::theme_set(ggplot2::theme_minimal()) The goal of scrutiny is to test published summary statistics for consistency using techniques like GRIM and to check their plausibility. The package makes these methods easy to use in a tidyverse-friendly way. It hopes to help the new field of error detection go mainstream. -Besides ready-made tests, scrutiny features a complete system for implementing new consistency tests. It also has duplication analysis, more general infrastructure for implementing error detection techniques, as well as specialized data wrangling functions. See the *Articles* tab for vignettes. +You can use the most important parts in the [Error detection](https://errors.shinyapps.io/scrutiny) Shiny app instead. -scrutiny is a work in progress. You are welcome to contribute with pull requests. However, please [open an issue](https://github.com/lhdjung/scrutiny/issues) first. +Besides ready-made tests, scrutiny features a complete system for implementing new consistency tests. It also has duplication analysis, more general infrastructure for implementing error detection techniques, as well as specialized data wrangling functions. See the *Articles* tab for vignettes. Install the package from CRAN: diff --git a/README.md b/README.md index f4f43cac..2430bf3e 100644 --- a/README.md +++ b/README.md @@ -15,16 +15,15 @@ consistency using techniques like GRIM and to check their plausibility. The package makes these methods easy to use in a tidyverse-friendly way. It hopes to help the new field of error detection go mainstream. +You can use the most important parts in the [Error +detection](https://errors.shinyapps.io/scrutiny) Shiny app instead. + Besides ready-made tests, scrutiny features a complete system for implementing new consistency tests. It also has duplication analysis, more general infrastructure for implementing error detection techniques, as well as specialized data wrangling functions. See the *Articles* tab for vignettes. -scrutiny is a work in progress. You are welcome to contribute with pull -requests. However, please [open an -issue](https://github.com/lhdjung/scrutiny/issues) first. - Install the package from CRAN: ``` r diff --git a/data-raw/cross-custom.R b/data-raw/cross-custom.R new file mode 100644 index 00000000..4a5369ac --- /dev/null +++ b/data-raw/cross-custom.R @@ -0,0 +1,41 @@ + +# Use `cross2_custom()` in data-gen.R to generate GRIM rasters after any change +# in the `grim_scalar()` implementation. This requires `.vary` to be its +# default, `"fastest"`. + +# These functions are copied from an MIT-licensed repo: +# https://github.com/lhdjung/lukas_jung_blog/blob/17a835e07d7096026ec8a88da994157afa393939/posts/purrr_cross_replacement/index.qmd +# They are rewritten versions of `purrr::cross2()` and the lower-level +# `purrr::cross()`, which were deprecated, so they are going to be removed from +# purrr at some point. + +# Using custom rewritten versions is an experimental alternative to the plan of +# temporarily reinstalling an older version of purrr. This plan is outlined at: +# https://github.com/lhdjung/scrutiny/issues/53 + + +# Basic function +cross_custom <- function(.l, .filter = NULL, + .vary = c("fastest", "slowest")) { + out <- vctrs::vec_expand_grid(!!!.l, .vary = .vary) + if (is.null(.filter)) { + return(as.list(out)) + } else if (!is.function(.filter)) { + cli::cli_abort(c( + "`.filter` must be a function.", + "i" = "(Also, it needs to return a single logical value.)" + )) + } + out %>% + dplyr::filter(!.filter(.x, .y)) %>% + as.list() +} + +# Special cases + +cross2_custom <- function(.x, .y, .filter = NULL, + .vary = c("fastest", "slowest")) { + cross_custom( + list(.x = .x, .y = .y), .filter = .filter, .vary = .vary + ) +} diff --git a/man/audit_list.Rd b/man/audit_list.Rd deleted file mode 100644 index 88db898e..00000000 --- a/man/audit_list.Rd +++ /dev/null @@ -1,27 +0,0 @@ -% Generated by roxygen2: do not edit by hand -% Please edit documentation in R/audit.R -\name{audit_list} -\alias{audit_list} -\title{Summaries in list form} -\usage{ -audit_list(data) -} -\value{ -Named list of \code{audit()}'s results. -} -\description{ -\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} - -\code{audit_list()} is deprecated. Use \code{audit()} instead. - -It was meant to be used when \code{audit()} would have returned tibbles that -were too wide to be read. However, the output format for \code{audit()} has now -been overhauled, there is no longer a need for \code{audit_list()}. -} -\examples{ -# Only use `audit()` instead: -pigs1 \%>\% - grim_map() \%>\% - audit() -} -\keyword{internal} diff --git a/man/debit.Rd b/man/debit.Rd index cf367e05..2b45a0d3 100644 --- a/man/debit.Rd +++ b/man/debit.Rd @@ -44,7 +44,7 @@ Logical. \code{TRUE} if \code{x}, \code{sd}, and \code{n} are mutually consisten } \description{ \code{debit()} tests summaries of binary data for consistency: If the -mean and the standard deviation of binary data are given, are they +mean and the sample standard deviation of binary data are given, are they consistent with the reported sample size? The function is vectorized, but it is recommended to use \code{\link[=debit_map]{debit_map()}} diff --git a/man/debit_map.Rd b/man/debit_map.Rd index 4f3fc933..76ba016b 100644 --- a/man/debit_map.Rd +++ b/man/debit_map.Rd @@ -43,8 +43,8 @@ generic. } \description{ Call \code{debit_map()} to use DEBIT on multiple combinations of -mean, standard deviation, and sample size of binary distributions. Mapping -function for \code{\link[=debit]{debit()}}. +mean, sample standard deviation, and sample size of binary distributions. +Mapping function for \code{\link[=debit]{debit()}}. For summary statistics, call \code{\link[=audit]{audit()}} on the results. } diff --git a/man/duplicate_count.Rd b/man/duplicate_count.Rd index 5bbfbb26..27f795b0 100644 --- a/man/duplicate_count.Rd +++ b/man/duplicate_count.Rd @@ -4,12 +4,7 @@ \alias{duplicate_count} \title{Count duplicate values} \usage{ -duplicate_count( - x, - ignore = NULL, - locations_type = c("character", "list"), - numeric_only = deprecated() -) +duplicate_count(x, ignore = NULL, locations_type = c("character", "list")) } \arguments{ \item{x}{Vector or data frame.} @@ -20,9 +15,6 @@ duplicate_count( \code{"list"}, each \code{locations} value is a vector of column names, which is better for further programming. By default (\code{"character"}), the column names are pasted into a string, which is more readable.} - -\item{numeric_only}{[\link{Deprecated}] No longer used: All values are coerced to -character.} } \value{ If \code{x} is a data frame or another named vector, a tibble with four @@ -46,13 +38,6 @@ too informative if many values have few characters each. For summary statistics, call \code{\link[=audit]{audit()}} on the results. } -\details{ -Don't use \code{numeric_only}. It no longer has any effect and will be -removed in the future. The only reason for this argument was the risk of -errors introduced by coercing values to numeric. This is no longer an issue -because all values are now coerced to character, which is more appropriate -for checking reported statistics. -} \section{Summaries with \code{\link[=audit]{audit()}}}{ There is an S3 method for the \code{\link[=audit]{audit()}} generic, so you can call \code{\link[=audit]{audit()}} following diff --git a/man/duplicate_count_colpair.Rd b/man/duplicate_count_colpair.Rd index 7aa548a4..13e64233 100644 --- a/man/duplicate_count_colpair.Rd +++ b/man/duplicate_count_colpair.Rd @@ -4,12 +4,7 @@ \alias{duplicate_count_colpair} \title{Count duplicate values by column} \usage{ -duplicate_count_colpair( - data, - ignore = NULL, - show_rates = TRUE, - na.rm = deprecated() -) +duplicate_count_colpair(data, ignore = NULL, show_rates = TRUE) } \arguments{ \item{data}{Data frame.} @@ -20,8 +15,6 @@ duplicates.} \item{show_rates}{Logical. If \code{TRUE} (the default), adds columns \code{rate_x} and \code{rate_y}. See value section. Set \code{show_rates} to \code{FALSE} for higher performance.} - -\item{na.rm}{[\link{Deprecated}] Missing values are never counted in any case.} } \value{ A tibble (data frame) with these columns – diff --git a/man/duplicate_detect.Rd b/man/duplicate_detect.Rd index 40d597ac..40ec234e 100644 --- a/man/duplicate_detect.Rd +++ b/man/duplicate_detect.Rd @@ -4,7 +4,7 @@ \alias{duplicate_detect} \title{Detect duplicate values} \usage{ -duplicate_detect(x, ignore = NULL, colname_end = "dup", numeric_only) +duplicate_detect(x, ignore = NULL, colname_end = "dup") } \arguments{ \item{x}{Vector or data frame.} @@ -14,9 +14,6 @@ the test result columns, they will be marked \code{NA}.} \item{colname_end}{String. Name ending of the logical test result columns. Default is \code{"dup"}.} - -\item{numeric_only}{[\link{Deprecated}] No longer used: All values are coerced to -character.} } \value{ A tibble (data frame). It has all the columns from \code{x}, and to each diff --git a/man/figures/README-unnamed-chunk-5-1.png b/man/figures/README-unnamed-chunk-5-1.png index 37a9ed8b..c0916ffd 100644 Binary files a/man/figures/README-unnamed-chunk-5-1.png and b/man/figures/README-unnamed-chunk-5-1.png differ diff --git a/man/figures/README-unnamed-chunk-6-1.png b/man/figures/README-unnamed-chunk-6-1.png index 07e8e002..b9187c7f 100644 Binary files a/man/figures/README-unnamed-chunk-6-1.png and b/man/figures/README-unnamed-chunk-6-1.png differ diff --git a/man/grim-stats.Rd b/man/grim-stats.Rd index e6f555bf..2594f0f2 100644 --- a/man/grim-stats.Rd +++ b/man/grim-stats.Rd @@ -2,23 +2,22 @@ % Please edit documentation in R/grim-stats.R \name{grim-stats} \alias{grim-stats} -\alias{grim_total} +\alias{grim_probability} \alias{grim_ratio} -\alias{grim_ratio_upper} +\alias{grim_total} \title{Possible GRIM inconsistencies} \usage{ -grim_total(x, n, items = 1, percent = FALSE) +grim_probability(x, n, items = 1, percent = FALSE) grim_ratio(x, n, items = 1, percent = FALSE) -grim_ratio_upper(x, percent = FALSE) +grim_total(x, n, items = 1, percent = FALSE) } \arguments{ -\item{x}{String or numeric (length 1). Mean or percentage value computed from -data with integer units (e.g., mean scores on a Likert scale or percentage -of study participants in some condition). \emph{Note}: Numeric inputs don't -include trailing zeros, but these are important for GRIM functions. See -documentation for \code{grim()}.} +\item{x}{String (length 1). Mean or percentage value computed from data with +integer units, e.g., mean scores on a Likert scale or percentage of study +participants in some condition. It has to be string to capture any trailing +zeros.} \item{n}{Integer. Sample size corresponding to \code{x}.} @@ -30,37 +29,43 @@ proportion of 100 rather than 1. The functions will then account for this fact through increasing the decimal count by 2. Default is \code{FALSE}.} } \value{ -Integer or double. The number or proportion of possible GRIM -inconsistencies. +Integer or double. The number of possible GRIM inconsistencies, or +their probability for a random mean or percentage with a given number of +decimal places. } \description{ -Even without GRIM-testing, means / proportions and sample sizes -of granular distributions entail some key data: +These functions compute statistics related to GRIM-testing. In +general, \code{grim_probability()} is the most useful of them, and it is +responsible for the \code{probability} column in a data frame returned by +\code{\link[=grim_map]{grim_map()}}. \itemize{ +\item \code{grim_probability()} returns the probability that a reported mean or +percentage of integer data that is random except for the number of its +decimal places is inconsistent with the reported sample size. For example, +the mean 1.23 is treated like any other mean with two decimal places. +\item \code{grim_ratio()} is equal to \code{grim_probability()} unless \code{grim_ratio()} is +negative, which can occur if the sample size is very large. Strictly +speaking, this is more informative than \code{grim_probability()}, but it is +harder to interpret. \item \code{grim_total()} returns the absolute number of GRIM-inconsistencies that -are possible given the mean or percentage's number of decimal places (\code{D}) -and the corresponding sample size. -\item \code{grim_ratio()} returns a proportion that is normalized by \code{10^D}, and -therefore comparable across mean or percentage values reported to varying -\code{D}. -\item \code{grim_ratio_upper()} returns the upper bound of \code{grim_ratio()} for a -given \code{D}. +are possible given the mean or percentage's number of decimal places and +the corresponding sample size. } For discussion, see \code{vignette("grim")}, section \emph{GRIM statistics}. } \examples{ # Many value sets are inconsistent here: +grim_probability(x = "83.29", n = 21) grim_total(x = "83.29", n = 21) -grim_ratio(x = "83.29", n = 21) # No sets are inconsistent in this case... +grim_probability(x = "5.14", n = 83) grim_total(x = "5.14", n = 83) -grim_ratio(x = "5.14", n = 83) # ... but most would be if `x` was a percentage: +grim_probability(x = "5.14", n = 83, percent = TRUE) grim_total(x = "5.14", n = 83, percent = TRUE) -grim_ratio(x = "5.14", n = 83, percent = TRUE) } \references{ Brown, N. J. L., & Heathers, J. A. J. (2017). The GRIM Test: A @@ -69,6 +74,6 @@ Psychology. \emph{Social Psychological and Personality Science}, 8(4), 363–369 https://journals.sagepub.com/doi/10.1177/1948550616673876 } \seealso{ -\code{grim()} for the GRIM test itself; as well as \code{grim_map()} for +\code{\link[=grim]{grim()}} for the GRIM test itself; as well as \code{\link[=grim_map]{grim_map()}} for applying it to many cases at once. } diff --git a/man/grim_granularity.Rd b/man/grim_granularity.Rd index 48a7b077..76719ae8 100644 --- a/man/grim_granularity.Rd +++ b/man/grim_granularity.Rd @@ -17,9 +17,10 @@ which will hold for most non-Likert scales.} \item{gran}{Numeric. Granularity.} -\item{tolerance}{Numeric. In \code{grim_items()}, \code{tolerance} is the maximal -amount by which results may differ from being whole numbers. If they exceed -that amount, a warning will be shown.} +\item{tolerance}{Numeric. Any difference between \code{x} and a truncated version +of \code{x} less than \code{tolerance} (in the absolute value) will be ignored. The +default is close to \code{1 / (10 ^ 8)}. This avoids errors due to spurious +precision in floating-point arithmetic.} } \value{ Numeric. Granularity or number of items. @@ -38,9 +39,7 @@ the underlying formula is the same (and it's very simple). However, for clarity, they are presented as distinct. The output of \code{grim_items()} should be whole numbers, because scale items -have a granularity of 1. If they differ from the next whole number by more -than a numeric \code{tolerance} (which is determined by the argument by that -name), a warning will be shown. +have a granularity of 1. It would be wrong to determine a scale's granularity from the minimal distance between two values in a given distribution. This would only diff --git a/man/grim_map.Rd b/man/grim_map.Rd index 56aedd53..e269f140 100644 --- a/man/grim_map.Rd +++ b/man/grim_map.Rd @@ -12,7 +12,7 @@ grim_map( x = NULL, n = NULL, show_rec = FALSE, - show_prob = FALSE, + show_prob = deprecated(), rounding = "up_or_down", threshold = 5, symmetric = FALSE, @@ -46,9 +46,9 @@ Default is \code{FALSE}.} GRIM-testing are shown as columns. See section \emph{Reconstructed numbers} below. Default is \code{FALSE}.} -\item{show_prob}{Logical. If set to \code{TRUE}, adds a \code{prob} column that -contains the probability of GRIM inconsistency. This is simply the \code{ratio} -column censored to range between 0 and 1. Default is \code{FALSE}.} +\item{show_prob}{\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} Logical. No longer +supported: now, there is always a \code{probability} column. (It replaces the +earlier \code{ratio} column.)} \item{rounding, threshold, symmetric, tolerance}{Further parameters of GRIM-testing; see documentation for \code{\link[=grim]{grim()}}.} @@ -67,8 +67,9 @@ A tibble with these columns -- \itemize{ \item \code{x}, \code{n}: the inputs. \item \code{consistency}: GRIM consistency of \code{x}, \code{n}, and \code{items}. +\item \code{probability}: the probability of GRIM inconsistency; see +\code{\link[=grim_probability]{grim_probability()}}. \item \verb{}: any columns from \code{data} other than \code{x}, \code{n}, and \code{items}. -\item \code{ratio}: the GRIM ratio; see \code{\link[=grim_ratio]{grim_ratio()}}. The tibble has the \code{scr_grim_map} class, which is recognized by the \code{\link[=audit]{audit()}} generic. @@ -85,7 +86,7 @@ convert \code{x} values to decimals and adjust the decimal count accordingly. Display intermediary numbers from GRIM-testing in columns by setting \code{show_rec} to \code{TRUE}. -For summary statistics, call \verb{[audit()}] on the results. +For summary statistics, call \code{\link[=audit]{audit()}} on the results. } \section{Reconstructed numbers}{ If \code{show_rec} is set to \code{TRUE}, the output @@ -112,10 +113,10 @@ so you can call \code{\link[=audit]{audit()}} following \code{grim_map()} to get \item \code{incons_cases}: number of GRIM-inconsistent value sets. \item \code{all_cases}: total number of value sets. \item \code{incons_rate}: proportion of GRIM-inconsistent value sets. -\item \code{mean_grim_ratio}: average of GRIM ratios. -\item \code{incons_to_ratio}: ratio of \code{incons_rate} to \code{mean_grim_ratio}. +\item \code{mean_grim_prob}: average probability of GRIM inconsistency. +\item \code{incons_to_prob}: ratio of \code{incons_rate} to \code{mean_grim_prob}. \item \code{testable_cases}: number of GRIM-testable value sets (i.e., those with a -positive ratio). +positive \code{probability}). \item \code{testable_rate}: proportion of GRIM-testable value sets. } } diff --git a/man/grim_ratio_upper.Rd b/man/grim_ratio_upper.Rd new file mode 100644 index 00000000..e4cb93ac --- /dev/null +++ b/man/grim_ratio_upper.Rd @@ -0,0 +1,30 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/grim-stats.R +\name{grim_ratio_upper} +\alias{grim_ratio_upper} +\title{Upper bound for the GRIM ratio} +\usage{ +grim_ratio_upper(x, percent = FALSE) +} +\arguments{ +\item{x}{String (length 1). Mean or percentage value computed from data with +integer units, e.g., mean scores on a Likert scale or percentage of study +participants in some condition. It has to be string to capture any trailing +zeros.} + +\item{percent}{Logical. Set \code{percent} to \code{TRUE} if \code{x} is expressed as a +proportion of 100 rather than 1. The functions will then account for this +fact through increasing the decimal count by 2. Default is \code{FALSE}.} +} +\value{ +Numeric. +} +\description{ +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} + +\code{grim_ratio_upper()} is deprecated because it no longer seems very +meaningful. It will be removed in a future version. + +See \code{\link[=grim_probability]{grim_probability()}} for a more interesting measure. +} +\keyword{internal} diff --git a/man/restore_zeros.Rd b/man/restore_zeros.Rd index b509c066..1be8c6be 100644 --- a/man/restore_zeros.Rd +++ b/man/restore_zeros.Rd @@ -5,13 +5,7 @@ \alias{restore_zeros_df} \title{Restore trailing zeros} \usage{ -restore_zeros( - x, - width = NULL, - sep_in = "\\\\.", - sep_out = sep_in, - sep = deprecated() -) +restore_zeros(x, width = NULL, sep_in = "\\\\.", sep_out = sep_in) restore_zeros_df( data, @@ -21,7 +15,6 @@ restore_zeros_df( width = NULL, sep_in = "\\\\.", sep_out = NULL, - sep = deprecated(), ... ) } @@ -40,9 +33,6 @@ part. Default is \code{"\\\\."}, which renders a decimal point.} mantissa from the integer part. By default, \code{sep_out} is the same as \code{sep_in}.} -\item{sep}{[\link{Deprecated}] Use \code{sep_in}, not \code{sep}. If \code{sep} is specified -nonetheless, \code{sep_in} takes on \code{sep}'s value.} - \item{data}{Data frame or matrix. Only in \code{restore_zeros_df()}, and instead of \code{x}.} diff --git a/man/row_to_colnames.Rd b/man/row_to_colnames.Rd index 892c7e0b..fe953980 100644 --- a/man/row_to_colnames.Rd +++ b/man/row_to_colnames.Rd @@ -40,6 +40,6 @@ data frames (converted from matrices) do sometimes have the issue described above. } \seealso{ -\code{\link[unheadr:mash_colnames]{unheadr::mash_colnames()}}, a more sophisticated solution to the +\code{unheadr::mash_colnames()}, a more sophisticated solution to the same problem. } diff --git a/man/seq-predicates.Rd b/man/seq-predicates.Rd index 6292ac82..fad19bff 100644 --- a/man/seq-predicates.Rd +++ b/man/seq-predicates.Rd @@ -92,7 +92,7 @@ is_seq_descending(x = c(9, 15, 3), test_linear = FALSE) is_seq_dispersed(1:10, from = 5, test_linear = FALSE) } \seealso{ -\code{\link[validate:is_linear_sequence]{validate::is_linear_sequence()}}, which is much like +\code{validate::is_linear_sequence()}, which is much like \code{is_seq_linear()} but more permissive with \code{NA} values. It comes with some additional features, such as support for date-times. } diff --git a/man/split_by_parens.Rd b/man/split_by_parens.Rd index 3868d2c6..58d8b2cc 100644 --- a/man/split_by_parens.Rd +++ b/man/split_by_parens.Rd @@ -123,7 +123,7 @@ df3 \%>\% \item \code{\link[=before_parens]{before_parens()}} and \code{\link[=inside_parens]{inside_parens()}} take a string vector and extract values from the respective position. \item \code{\link[dplyr:across]{dplyr::across()}} powers the application of the two above functions -within split_by_parens()`, including the creation of new columns. +within \code{split_by_parens()}, including the creation of new columns. \item \code{\link[tidyr:separate_wider_delim]{tidyr::separate_wider_delim()}} is a more general function, but it does not recognize closing elements such as closed parentheses. } diff --git a/man/subset-superset.Rd b/man/subset-superset.Rd index 29a877ba..c4d31270 100644 --- a/man/subset-superset.Rd +++ b/man/subset-superset.Rd @@ -64,9 +64,16 @@ A single logical value. \code{TRUE} if the respective test was passed, \code{FALSE} otherwise. } \description{ -Predicate functions that take a vector and test whether it has -some particular relation to another vector. That second vector is entered -in either of three ways -- +\ifelse{html}{\href{https://lifecycle.r-lib.org/articles/stages.html#deprecated}{\figure{lifecycle-deprecated.svg}{options: alt='[Deprecated]'}}}{\strong{[Deprecated]}} + +All of these functions are deprecated and will be removed in a future +version. They are a poor fit for the problem they try to solve, and they +are far out of scope for the package. +\subsection{Original documentation}{ + +Predicate functions that take a vector and test whether it has some +particular relation to another vector. That second vector is entered in +either of three ways -- \strong{Enter it directly (basic functions):} @@ -95,6 +102,7 @@ variants also test whether the sets are unequal, so that \code{x} is a subset of \code{y} but \code{y} is not a subset of \code{x}. The same applies to \verb{is_superset*()} functions and their \verb{is_proper_superset*()} variants. } +} \details{ The \verb{*_vals()} variants are meant for flexible, interactive subset/superset testing. That is, in order to test whether certain values @@ -143,3 +151,4 @@ abc \%>\% is_subset_of_vals("a", "b", "c", "d", "e") # (... and likewise for supersets and equal sets.) } +\keyword{internal} diff --git a/special-scripts/grimmer-replace-names.R b/special-scripts/grimmer-replace-names.R new file mode 100644 index 00000000..8637a2ac --- /dev/null +++ b/special-scripts/grimmer-replace-names.R @@ -0,0 +1,75 @@ + +# NOTE: All variables that should be renamed in scrutiny's version of GRIMMER +# need to be entered here! + +# -- Variables that can't be replaced targeting whole words only, such as any +# that include dots, need to be entered in `grimmer_names_whole_word_false`. +# -- All other variables need to be entered in `grimmer_names`. + + +# Load helper functions: +source("special-scripts/replace-in-file.R") + +# File where variable names should be replaced: +path_file <- "R/grimmer-rsprite2.R" + + +# Dots don't count toward words, so the script needs this extra table with names +# that include dots. `replace_from_df()` will then run on it with `whole_word = +# FALSE`. Note the "\\." escape sequence! +grimmer_names_whole_word_false <- tibble::tribble( + ~rsprite2, ~scrutiny, + "\\.sd_limits", "sd_bounds_measure" +) + +replace_from_df( + path = path_file, + df_names = grimmer_names_whole_word_false, + col_pattern = "rsprite2", + col_replacement = "scrutiny", + whole_word = FALSE +) + + +grimmer_names <- tibble::tribble( + ~rsprite2, ~scrutiny, + # --- Function --- + "GRIMMER_test", "grimmer_scalar", + # --- Arguments --- + # (Note that some arguments from `rsprite2::GRIMMER_test()` are missing in + # scrutiny's adapted version, and vice versa.) + "mean", "x", + "SD", "sd", + "n_obs", "n", + "n_items", "items", + # --- Internal variables --- + "decimals_mean", "digits_x", + "decimals_SD", "digits_sd", + "realmean", "x_real", + "realsum", "sum_real", + "sd_limits", "sd_bounds", + "effective_n", "n_items", + "Lsigma", "sd_lower", + "Usigma", "sd_upper", + "Lowerbound", "sum_squares_lower", + "Upperbound", "sum_squares_upper", + "possible_integers", "integers_possible", + "Predicted_Variance", "var_predicted", + "Predicted_SD", "sd_predicted", + "oddness", "parity", + "Matches_Oddness", "matches_parity", + "FirstTest", "pass_test1", + "Rounded_SD_down", "sd_rounded_down", + "Rounded_SD_up", "sd_rounded_up", + "Matches_SD", "matches_sd", + "Third_Test", "pass_test3", +) + +# Replace the rsprite2 variable names by those of scrutiny: +replace_from_df( + path = path_file, + df_names = grimmer_names, + col_pattern = "rsprite2", + col_replacement = "scrutiny" +) + diff --git a/special-scripts/grimmer-rsprite2-original/core-functions.R b/special-scripts/grimmer-rsprite2-original/core-functions.R new file mode 100644 index 00000000..49b933b9 --- /dev/null +++ b/special-scripts/grimmer-rsprite2-original/core-functions.R @@ -0,0 +1,847 @@ +#' Define parameters for SPRITE algorithm +#' +#' The SPRITE algorithm aims to construct possible distributions that conform to +#' observed/reported parameters. This function performs some checks and returns a list of these +#' parameters that can then be passed to the functions that actually generate +#' the distributions (e.g. \code{\link{find_possible_distribution}}) +#' +#' Restrictions can be used to define how often a specific value should appear in the sample. +#' They need to be passed as a list in the form `value = frequency`. Thus, to specify that +#' there should be no 3s and five 4s in the distribution, you would pass +#' `restrictions_exact = list("3" = 0, "4" = 5)`. To specify that there should be at least +#' one 1 and one 7, you would pass `restrictions_minimum = list("1" = 1, "7" = 1)`. If you just want to +#' specify that the minimum and maximum values appear at least once (for instance when they are the +#' reported rather than possible range), you can use the shortcut `restrictions_minimum = "range"`. Finally, +#' if you work with multi-item scales that result in decimal responses, round those names to two decimal points, e.g., +#' when `n_items = 3` you could specify `list("1.67" = 0)`. +#' +#' @param mean The mean of the distribution +#' @param sd The standard deviation of the distribution +#' @param n_obs The number of observations (sample size) +#' @param min_val The minimum value +#' @param max_val The maximum value +#' @param m_prec The precision of the mean, as number of digits after the decimal point. +#' If not provided, taken based on the significant digits of `mean` - so only needed if reported mean ends in 0 +#' @param sd_prec The precision of the standard deviation, again only needed if +#' reported standard deviation ends in 0. +#' @param n_items Number of items in scale, if distribution represents scale averages. +#' Defaults to 1, which represents any single-item measure. +#' @param restrictions_exact Restrictions on the exact frequency of specific responses, see Details +#' @param restrictions_minimum Restrictions on the minimum frequency of specific responses, see Details +#' @param dont_test By default, this function tests whether the mean is possible, given the sample size (GRIM-test) and whether +#' the standard deviation is possible, given mean and sample size (GRIMMER test), and fails otherwise. If you want to override this, +#' and run SPRITE anyway, you can set this to TRUE. +#' +#' @return A named list of parameters, pre-processed for further rsprite2 functions. +#' +#' @examples +#' +#' set.seed(1234) #To get reproducible results +#' +#' # Simple case +#' sprite_parameters <- set_parameters(mean = 2.2, sd = 1.3, n_obs = 20, min_val = 1, max_val = 5) +#' find_possible_distribution(sprite_parameters) +#' +#' # With restrictions +#' sprite_parameters <- set_parameters(mean = 1.95, sd = 1.55, n_obs = 20, +#' min_val = 1, max_val = 5, n_items = 3, +#' restrictions_exact = list("3"=0, "3.67" = 2), +#' restrictions_minimum = "range") +#' find_possible_distribution(sprite_parameters) +#' +#' @export + +set_parameters <- function(mean, sd, n_obs, min_val, max_val, + m_prec = NULL, sd_prec = NULL, + n_items = 1, restrictions_exact = NULL, + restrictions_minimum = NULL, + dont_test = FALSE) { + if (is.null(m_prec)) { + m_prec <- max(nchar(sub("^[0-9]*", "", mean)) - 1, 0) + } + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", sd)) - 1, 0) + } + + assert_count(m_prec) + assert_count(sd_prec) + assert_count(n_obs) + assert_count(n_items) + assert_int(min_val) + assert_int(max_val) + assert_number(mean) + assert_number(sd) + + if (min_val >= max_val) { + stop("max_val needs to be larger than min_val") + } + + if (!dont_test) { + + if (n_obs * n_items <= 10 ^ m_prec) { + if (!GRIM_test(mean, n_obs, m_prec, n_items)) { + stop("The mean is not consistent with this number of observations (fails GRIM test). + You can use GRIM_test() to identify the closest possible mean and try again.") + } + } + + if (!GRIMMER_test(mean, sd, n_obs, m_prec, sd_prec, n_items)) { + stop("The standard deviation is not consistent with this mean and number of observations (fails GRIMMER test). + For details, see ?GRIMMER_test.") + } + } + + sd_limits <- .sd_limits(n_obs, mean, min_val, max_val, sd_prec, n_items) + + if (!(sd >= sd_limits[1] & sd <= sd_limits[2])) { + stop("The standard deviation is outside the possible range, given the other parameters. + It should be between ", sd_limits[1], " and ", sd_limits[2], ".") + } + + if (!(mean >= min_val & mean <= max_val)) { + stop("The mean is outside the possible range, which is impossible - please check inputs.") + } + + if (isTRUE(checkmate::check_choice(restrictions_minimum, "range"))) { + restrictions_minimum <- list(1, 1) + names(restrictions_minimum) <- c(min_val, max_val) + } + + poss_values <- max_val + for (i in seq_len(n_items)) { + poss_values <- c(poss_values, min_val:(max_val-1) + (1 / n_items) * (i - 1)) + } + poss_values <- sort(poss_values) + + poss_values_chr <- round(poss_values, 2) + + fixed_responses <- numeric() + fixed_values <- NA + + + if(!is.null(restrictions_minimum)&!is.null(restrictions_exact)) { + + if(any(duplicated(c(round(as.numeric(names(restrictions_exact)), 2), round(as.numeric(names(restrictions_minimum)), 2))))) { + duplicated <- c(round(as.numeric(names(restrictions_exact)), 2), round(as.numeric(names(restrictions_minimum)), 2))[duplicated(c(round(as.numeric(names(restrictions_exact)), 2), round(as.numeric(names(restrictions_minimum)), 2)))] + stop("Several restrictions for same value found. Ensure there is only one restriction (exact or minimum) for: ", duplicated) + } +} + if(!is.null(restrictions_minimum)) { + + if(any(!(round(as.numeric(names(restrictions_minimum)), 2) %in% poss_values_chr))) { + no_match <- names(restrictions_minimum)[!(round(as.numeric(names(restrictions_minimum)), 2) %in% poss_values_chr)] + stop("Invalid names in restrictions_minimum. The following could not be matched to possible response values: ", no_match) + } + + #Ensure restrictions are ordered + restrictions_minimum <- restrictions_minimum[as.character(poss_values_chr[poss_values_chr %in% names(restrictions_minimum)])] + + fixed_responses <- c(fixed_responses, rep(poss_values[poss_values_chr %in% names(restrictions_minimum)], unlist(restrictions_minimum))) + + } + + if(!is.null(restrictions_exact)) { + + + if(any(!(round(as.numeric(names(restrictions_exact)), 2) %in% poss_values_chr))) { + no_match <- names(restrictions_exact)[!(round(as.numeric(names(restrictions_exact)), 2) %in% poss_values_chr)] + stop("Invalid names in restrictions_exact. The following could not be matched to possible response values: ", no_match) + } + + + #Ensure restrictions are ordered + restrictions_exact <- restrictions_exact[as.character(poss_values_chr[poss_values_chr %in% names(restrictions_exact)])] + + fixed_responses <- c(fixed_responses, rep(poss_values[poss_values_chr %in% names(restrictions_exact)], unlist(restrictions_exact))) + + fixed_values <- poss_values[poss_values_chr %in% names(restrictions_exact)] + } + + possible_values <- setdiff(poss_values, fixed_values) + n_fixed <- length(fixed_responses) + + out <- .named_list(mean, sd, n_obs, min_val, max_val, m_prec, sd_prec, n_items, restrictions_minimum, restrictions_exact, possible_values, fixed_values, fixed_responses, n_fixed) + + class(out) <- c("sprite_parameters", class(out)) + + out +} + +.named_list <- function(...) { + out <- list(...) + names(out) <- as.list(match.call())[-1] + out +} + +#' Find several possible distributions. +#' +#' This function aims to find several possible distribution that would give rise to +#' the observed sample parameters. For that, you need to pass a list of parameters, +#' created with \code{\link{set_parameters}} +#' +#' @param parameters List of parameters, see \code{\link{set_parameters}} +#' @param n_distributions The target number of distributions to return. +#' @param seed An integer to use as the seed for random number generation. Set this in scripts to ensure reproducibility. +#' @param return_tibble Should a tibble, rather than a list, be returned? Requires the `tibble`-package, ignored if that package is not available. +#' @param return_failures Should distributions that failed to produce the desired SD be returned? Defaults to false +#' +#' @return A tibble or list (depending on the `return_tibble` argument) with: +#' \item{outcome}{success or failure - character} +#' \item{distribution}{The distribution that was found (if success) / that had the closest variance (if failure) - numeric} +#' \item{mean}{The exact mean of the distribution - numeric} +#' \item{sd}{The SD of the distribution that was found (success) / that came closest (failure) - numeric} +#' \item{iterations}{The number of iterations required to achieve the specified SD - numeric - the first time this distribution was found} +#' +#' @examples +#' +#' sprite_parameters <- set_parameters(mean = 2.2, sd = 1.3, n_obs = 20, +#' min_val = 1, max_val = 5) +#' +#' find_possible_distributions(sprite_parameters, 5, seed = 1234) +#' +#' @export + + +find_possible_distributions <- function(parameters, n_distributions = 10, seed = NULL, return_tibble = TRUE, return_failures = FALSE) { + + if (!is.null(seed)) { + assert_int(seed) + set.seed(seed) + } + + assert_count(n_distributions) + assert_logical(return_tibble) + + outcome <- character() + distributions <- list() + found_sd <- numeric() + found_mean <- numeric() + iterations <- numeric() + + duplications <- 0 + + for (i in 1:(n_distributions * rSprite.maxDupLoops)) { + + n_found <- sum(outcome == "success") + + #This break should possibly be earlier? + if(length(outcome) - max(c(which(outcome == "success"),0)) >= 10) { + warning("No successful distribution found in last 10 attempts. Exiting.", if (n_found == 0) " There might not be any possible distribution, but you can try running the search again.") + break + } + if (n_found >= n_distributions) break + + # Calculate the maximum number of consecutive duplicates we will accept before deciding to give up. + # The value of 0.00001 below is our nominal acceptable chance of missing a valid solution; + # however, it's extremely likely that all possible solutions are not all equally likely to be found. + # So we also set a floor of 100 attempts. + max_duplications <- max(round(log(0.00001) / log(n_found / (n_found + 1))), 100) + + res <- find_possible_distribution(parameters, seed = NULL) + + res$values <- sort(res$values) # sorting lets us find duplicates more easily + + distributions <- c(list(res$values), distributions) + if(head(duplicated(distributions, fromLast = TRUE), 1)) { + distributions <- distributions[-1] + duplications <- duplications + 1 + if (duplications > max_duplications) { + break + } + } else { + outcome <- c(res$outcome, outcome) + found_sd <- c(res$sd, found_sd) + found_mean <- c(res$mean, found_mean) + iterations <- c(res$iterations, iterations) + } + + } + + if (n_found < n_distributions) message("Only ", n_found, " matching distributions could be found. You can try again - given that SPRITE is based on random number generation, more distributions might be found then.") + + if (return_tibble & suppressWarnings(requireNamespace("tibble", quietly = TRUE))) { + out <- tibble::tibble(id = seq_along(outcome), outcome = outcome, distribution = distributions, mean = found_mean, sd = found_sd, iterations = iterations) + class(out) <- c("sprite_distributions", class(out)) + attr(out, "parameters") <- parameters + if(!return_failures) return(out[out$outcome == "success",]) + out + } else { + if(!return_failures) { + successes <- outcome == "success" + return(list(outcome = outcome[successes], distribution = distributions[successes], mean = found_mean[successes], sd = found_sd[successes], iterations = iterations[successes])) + } + list(outcome = outcome, distribution = distributions, mean = found_mean, sd = found_sd, iterations = iterations) + } + +} + +#' Find a possible distribution. +#' +#' This function aims to find a possible distribution that would give rise to +#' the observed sample parameters. For that, you need to pass a list of parameters, +#' best created with \code{\link{set_parameters}} +#' +#' @param parameters List of parameters, see \code{\link{set_parameters}} +#' @param seed An integer to use as the seed for random number generation. Set this in scripts to ensure reproducibility. +#' @param values_only Should only values or a more informative list be returned. See Value section. +#' +#' @return Unless `values_only = TRUE`, a list with: +#' \item{outcome}{success or failure - character} +#' \item{distribution}{The distribution that was found (if success) / that had the closest variance (if failure) - numeric} +#' \item{mean}{The exact mean of the distribution - numeric} +#' \item{sd}{The SD of the distribution that was found (success) / that came closest (failure) - numeric} +#' \item{iterations}{The number of iterations required to achieve the specified SD - numeric} +#' If `values_only = TRUE`, then the distribution is returned if one was found, and NULL if it failed. +#' +#' @examples +#' sprite_parameters <- set_parameters(mean = 2.2, sd = 1.3, n_obs = 20, +#' min_val = 1, max_val = 5) +#' find_possible_distribution(sprite_parameters) +#' +#' @export +#' + + +find_possible_distribution <- function(parameters, seed = NULL, values_only = FALSE) { + + assert_class(parameters, "sprite_parameters") + + if (!is.null(seed)) { + assert_int(seed) + set.seed(seed) + } + + + # Generate some random starting data. + rN <- parameters$n_obs - parameters$n_fixed + vec <- sample(parameters$possible_values, rN, replace = TRUE) + + # Adjust mean of starting data. + max_loops <- parameters$n_obs * length(parameters$possible_values) + vec <- .adjust_mean(max_loops, vec, parameters$fixed_responses, parameters$mean, parameters$m_prec, parameters$possible_values) + + + # Find distribution that also matches SD + maxLoops <- min(max(round(parameters$n_obs * (length(parameters$possible_values)^2)), rSprite.maxDeltaLoopsLower), rSprite.maxDeltaLoopsUpper) + granule_sd <- ((0.1^parameters$sd_prec) / 2) + rSprite.dust # allow for rounding errors + + result <- NULL + + for (i in seq_len(maxLoops)) { + + #Should one break out of loop when vec no longer changes? Prob not worth all the comparisons? + current_sd <- sd(c(vec, parameters$fixed_responses)) + if (abs(current_sd - parameters$sd) <= granule_sd) { + result <- c(vec, parameters$fixed_responses) + iter <- i + break + } + vec <- .shift_values(vec, parameters$mean, parameters$sd, parameters$min_val, parameters$max_val, parameters$m_prec, parameters$sd_prec, parameters$fixed_responses, parameters$possible_values, parameters$fixed_values) + } + + if (!is.null(result)) { + if(values_only) return(result) + return(list(outcome = "success", values = result, mean = mean(result), sd = current_sd, iterations = iter)) + } else { + if(values_only) return(NULL) + return(list(outcome = "failure", values = c(vec, parameters$fixed_responses), mean = mean(c(vec, parameters$fixed_responses)), sd = current_sd, iterations = maxLoops)) + } +} + +.adjust_mean <- function(max_iter, vec, fixed_vals, target_mean, m_prec, poss_values) { #poss_values to exclude those restricted + + meanOK <- FALSE + + for (i in 1:max_iter) { + fullVec <- c(vec, fixed_vals) + current_mean <- mean(fullVec) + if ((round(current_mean, m_prec) == target_mean)) { + meanOK <- TRUE + break + } + + increaseMean <- (current_mean < target_mean) + if (increaseMean) { + filter <- (vec < (poss_values[length(poss_values)])) + } else { + filter <- (vec > (poss_values[1])) + } + + possible_bump <- which(filter) + bumpMean <- possible_bump[as.integer(runif(1) * length(possible_bump)) + 1] # select a number + vec[bumpMean] <- poss_values[which(poss_values == vec[bumpMean]) + ifelse(increaseMean, 1, -1)] + } + if (!meanOK) { + if (length(fixed_vals)>0) { + stop("Couldn't initialize data with correct mean. This *might* be because the restrictions cannot be satisfied.") + } else { + stop("Couldn't initialize data with correct mean") # this probably indicates a coding error, if the mean is in range + } + } + return(vec) +} + +.shift_values <- function(vec, target_mean, target_sd, min_val, max_val, m_prec = 2, sd_prec, fixed_responses, poss_non_restricted, fixed_vals) { #poss_vals are only those not + + # Backup + vec_original <- vec + + # Decide if we want to increment or decrement first. + incFirst <- sample(c(TRUE, FALSE), 1) + + # Decide if we need to increase or decrease the SD. + fullVec <- c(vec, fixed_responses) + increaseSD <- (sd(fullVec) < target_sd) + + poss_values <- sort(c(poss_non_restricted, fixed_vals)) + + maxToInc <- poss_values[length(poss_values) - 1] # maximum value that we can increment + minToDec <- poss_values[2] # minimum value that we can decrement + + # Select an element to increment or decrement. + # For better performance, we select from unique elements only; this means that any number that appears in the vector is + # equally likely to be chosen regardless of how often it appears. I'm not sure if this is good or bad. + # TK - check performance impact. + uniqueCanBump1 <- !duplicated(vec) + + # The element that we change should be less than the maximum (increment) or greater than the minimum (decrement). + notEdge1 <- if (incFirst) (vec <= maxToInc) else (vec >= minToDec) + indexCanBump1 <- uniqueCanBump1 & notEdge1 + + # If we can't find an element to change, just return the original vector and let our caller sort it out. + if (sum(indexCanBump1) == 0) { + return(vec_original) + } + + # Unless we have no other choice: + # - If we want to make the SD larger, there is no point in incrementing the smallest element, or decrementing the largest + # - If we want to make the SD smaller, there is no point in decrementing the smallest element, or incrementing the largest + if (increaseSD) { + noPoint1 <- if (incFirst) (vec == min(vec)) else (vec == max(vec)) + } else { + noPoint1 <- if (incFirst) (vec == maxToInc) else (vec == minToDec) + } + indexCanBump1Try <- indexCanBump1 & (!noPoint1) + if (any(indexCanBump1Try)) { + indexCanBump1 <- indexCanBump1Try + } + + whichCanBump1 <- which(indexCanBump1) + whichWillBump1 <- whichCanBump1[as.integer(runif(1) * length(whichCanBump1)) + 1] + willBump1 <- vec[whichWillBump1] + new1 <- poss_non_restricted[which(poss_non_restricted == willBump1) + ifelse(incFirst, 1, -1)] + gap1 <- new1 - vec[whichWillBump1] # Note when restricted values have been skipped + vec[whichWillBump1] <- new1 + + # At this point we can decide to only change one of the elements (decrement one without incrementing another, or vice versa). + # This enables us to explore different means that still round to the same target value. + # So here we perform the first increment or decrement first, and see if the mean is still GRIM-consistent with the target mean. + # If it is, then in a proportion of cases we don't adjust the other cell. + newFullVec <- c(vec, fixed_responses) + newMean <- mean(newFullVec) + meanChanged <- (round(newMean, m_prec) != target_mean) # new mean is no longer GRIM-consistent + + if (meanChanged || (runif(1) < 0.4)) { + vecBump2 <- vec # make a scratch copy of the input vector so we can change it + vecBump2[whichWillBump1] <- NA # remove the element chosen in part 1... + uniqueCanBump2 <- !duplicated(vecBump2) # ... but if there was more than one copy of that, it's still a candidate + notEdge2 <- if (!incFirst) (vecBump2 <= maxToInc) else (vecBump2 >= minToDec) + indexCanBump2 <- uniqueCanBump2 & notEdge2 & (!is.na(vecBump2)) + + # If we can't find an element to change in the opposite direction to the first, then if the mean with the first change is still OK, + # we return either the vector with that change. Otherwise we return the original vector and let our caller sort it out. + if (sum(indexCanBump2) == 0) { + return(if (meanChanged) vec_original else vec) + } + + # Unless we have no other choice: + # - If we want to make the SD larger: + # - If in step 1 we chose an element to increment, there is no point in now changing (decrementing) a larger one + # - If in step 1 we chose an element to decrement, there is no point in now changing (incrementing) a smaller one + # - If we want to make the SD smaller: + # - If in step 1 we chose an element to increment, there is no point in now changing (decrementing) a smaller one + # - If in step 1 we chose an element to decrement, there is no point in now changing (incrementing) a larger one + # There is also no point in incrementing an element that is equal to the new value of the one that we have already chosen. + noPoint2 <- ((if (increaseSD == incFirst) (vec > willBump1) else (vec < willBump1)) | + (vec == (new1)) + ) + indexCanBump2Try <- indexCanBump2 & (!noPoint2) + if (any(indexCanBump2Try)) { + indexCanBump2 <- indexCanBump2Try + } + + whichCanBump2 <- which(indexCanBump2) + whichWillBump2 <- whichCanBump2[as.integer(runif(1) * length(whichCanBump2)) + 1] + willBump2 <- vec[whichWillBump2] + new2 <- poss_non_restricted[which(poss_non_restricted == willBump2) + ifelse(incFirst, -1, 1)] + gap2 <- new2 - vec[whichWillBump2] # Note when restricted values have been skipped + + gap_resolved <- NA + # Go into restricted values handling only when necessary - should be good for performance, but + # leads to more complex backtracking here. + if (!.equalish(abs(gap1), abs(gap2))) { + gap_resolved <- FALSE + poss <- which(.equalish(diff(poss_non_restricted), abs(gap1))) + if (length(poss) > 1) { + low <- poss_non_restricted[poss] + up <- poss_non_restricted[poss + 1] + if (incFirst) { + # Should not now move down from target - otherwise we are simply reversing course + target <- which(up == new1) + } else { + target <- which(low == new1) + } + up <- up[-target] + low <- low[-target] + + if (incFirst) { + from <- up[up %in% vec] + to <- low[up %in% vec] + } else { + to <- up[low %in% vec] + from <- low[low %in% vec] + } + if (length(from) > 0) { + replace <- sample(seq_along(from), 1) + vec[cumsum(cumsum(vec == from[replace])) == 1] <- to[replace] + gap_resolved <- TRUE + } + } + + if (!gap_resolved) { + # Cannot do a single second replacement without undoing change so far + restricted_runs <- rle(poss_values %in% poss_non_restricted) + longest_run <- max(restricted_runs$lengths[restricted_runs$values == FALSE]) + vec_backup <- vec + for (i in seq_len(longest_run)) { + vec <- vec_backup + replaced <- 0 + i <- i + 1 # Gap of 1 suggest 2 steps might be needed + poss <- which(.equalish(diff(poss_non_restricted), gap1 / i)) + if (length(poss) > 0) { + low <- poss_non_restricted[poss] + up <- poss_non_restricted[poss + 1] + + for (j in seq_len(i)) { + if (incFirst) { + from <- up[up %in% vec] + to <- low[up %in% vec] + } else { + to <- up[low %in% vec] + from <- low[low %in% vec] + } + if (length(from) > 0) { + replaced <- replaced + 1 + replace <- sample(seq_along(from), 1) + vec[cumsum(cumsum(vec == from[replace])) == 1] <- to[replace] + } else { + break + } + } + if (replaced == i) { + gap_resolved <- TRUE + break + } + } + } + + + if (!gap_resolved) { + # No way to get this done with multiple replacements - so, exit + return(if (meanChanged) vec_original else vec) + } + } + } else { + vec[whichWillBump2] <- new2 + } + } + + newFullVec <- c(vec, fixed_responses) + newMean <- mean(newFullVec) + meanChanged <- (round(newMean, m_prec) != target_mean) # new mean is no longer GRIM-consistent + + # Floating point issues lead to mean drift with multi-item scales - curtail this straight-away + if (meanChanged) return(vec_original) + + #if(!.equalish(sum(added), sum(removed)) & meanChanged) browser() + + #message(gap_resolved, "Mean diff detected when adding ", added, " while removing ", removed) + + return(vec) + +} + +.equalish <- function(x, y, tol = rSprite.dust) { + x <= (y + rSprite.dust) & + x >= (y - rSprite.dust) +} + +#' GRIM test for mean +#' +#' This function tests whether a given mean (with a specific precision) can +#' result from a sample of a given size based on integer responses to one or more +#' items. The test is based on Brown & Heathers (2017). +#' If `return_values = TRUE` and if there is more than one precise mean compatible +#' with the given parameters, all possible means are returned. In that case, if the +#' given mean is not consistent, the closest consistent mean is returned with a +#' warning. +#' +#' @param return_values Should all means consistent with the given parameters be returned? +#' @inheritParams set_parameters +#' +#' @return Either TRUE/FALSE, or all possible means (if test passes)/closest consistent mean (if test fails) +#' @export +#' +#' @examples +#' # A sample of 28 integers cannot result in a mean of 5.19. This is shown by +#' GRIM_test(5.19, 28) +#' +#' # To find the closest possible mean, set return_values to TRUE +#' GRIM_test(5.19, 28, return_values = TRUE) +#' +#' @references +#' \insertRef{brown2017grim}{rsprite2} + + +GRIM_test <- function(mean, n_obs, m_prec = NULL, n_items = 1, return_values = FALSE) { + if (is.null(m_prec)) { + m_prec <- max(nchar(sub("^[0-9]*", "", mean)) - 1, 0) + } + + assert_count(m_prec) + assert_count(n_obs) + assert_count(n_items) + assert_logical(return_values) + assert_number(mean) + + if (n_obs * n_items > 10 ^ m_prec) { + warning("The sample size (x number of items) is too big compared to the precision of the reported mean. The GRIM test is only meaningful when N < 10 ^ precision (e.g. N < 100 for means reported to two decimal places).") + } + + int <- round(mean * n_obs * n_items) # nearest integer + frac <- int / (n_obs * n_items) # mean resulting from nearest integer + dif <- abs(mean - frac) + granule <- ((0.1^m_prec) / 2) + rSprite.dust # allow for rounding errors + if (dif > granule) { + if (!return_values) { + return(FALSE) + } + valid_mean <- round(frac, m_prec) + prec_format <- paste("%.", m_prec, "f", sep = "") + warning("Mean ", sprintf(prec_format, mean), " fails GRIM test - closest consistent value: ", sprintf(prec_format, valid_mean)) + return(valid_mean) + } else { + if (!return_values) { + return(TRUE) + } + possible_means <- frac + i <- 1 + original_int <- int + while (TRUE) { + int <- int + i + frac <- int / (n_obs * n_items) # mean resulting from nearest integer + dif <- abs(mean - frac) + if (dif > granule) break() + possible_means <- c(possible_means, frac) + i <- i + 1 + } + i <- 1 + int <- original_int + while (TRUE) { + int <- int - i + frac <- int / (n_obs * n_items) # mean resulting from nearest integer + dif <- abs(mean - frac) + if (dif > granule) break() + possible_means <- c(possible_means, frac) + i <- i + 1 + } + return(possible_means) + } + + stop("Branching error - should not get here") +} + +# Determine minimum and maximum SDs for given scale ranges, N, and mean. +.sd_limits <- function(n_obs, mean, min_val, max_val, sd_prec = NULL, n_items = 1) { + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", mean)) - 1, 0) + } + + result <- c(-Inf, Inf) + + aMax <- min_val # "aMax" means "value of a to produce the max SD" + aMin <- floor(mean*n_items)/n_items + bMax <- max(max_val, min_val + 1, aMin + 1) # sanity check (just max_val would normally be ok) + bMin <- aMin + 1/n_items + total <- round(mean * n_obs * n_items)/n_items + + poss_values <- max_val + for (i in seq_len(n_items)) { + poss_values <- c(poss_values, min_val:(max_val-1) + (1 / n_items) * (i - 1)) + } + poss_values <- sort(poss_values) + + for (abm in list(c(aMin, bMin, 1), c(aMax, bMax, 2))) { + + a <- abm[1] + b <- abm[2] + m <- abm[3] + + + k <- round((total - (n_obs * b)) / (a - b)) + k <- min(max(k, 1), n_obs - 1) # ensure there is at least one of each of two numbers + vec <- c(rep(a, k), rep(b, n_obs - k)) + diff <- sum(vec) - total + + if ((diff < 0)) { + vec <- c(rep(a, k - 1), a + abs(diff), rep(b, n_obs - k)) + } + else if ((diff > 0)) { + vec <- c(rep(a, k), b - diff, rep(b, n_obs - k - 1)) + } + + if (round(mean(vec), sd_prec) != round(mean, sd_prec) | !all(floor(vec*10e9) %in% floor(poss_values*10e9))) { + stop("Error in calculating range of possible standard deviations") + } + + result[m] <- round(sd(vec), sd_prec) + } + + return(result) +} + + +#' GRIMMER test for standard deviation +#' +#' This function tests whether a given standard deviation (with a specific precision) +#' can result from a sample of a given size based on integer responses to one or more +#' items. The test was first proposed by [Anaya (2016)](https://peerj.com/preprints/2400/); here, the algorithm +#' developed by [Allard (2018)](https://aurelienallard.netlify.app/post/anaytic-grimmer-possibility-standard-deviations/) is used, +#' extended by Aurélien Allard to support multi-item scales. +#' +#' @inheritParams set_parameters +#' @param min_val (Optional) Scale minimum. If provided alongside max_val, the function checks whether the SD is consistent with that range. +#' @param max_val (Optional) Scale maximum. +#' +#' @return Logical TRUE/FALSE indicating whether given standard deviation is possible, given the other parameters +#' @export +#' +#' @examples +#' # A sample of 18 integers with mean 3.44 cannot have an SD of 2.47. This is shown by +#' GRIMMER_test(mean = 3.44, sd = 2.47, n_obs = 18) +#' +#' +#' @references +#' \insertRef{anaya2016grimmer}{rsprite2} + +# ToDos: +# - add return_values argument to return possible SDs + +GRIMMER_test <- function(mean, sd, n_obs, m_prec = NULL, sd_prec = NULL, n_items = 1, min_val = NULL, max_val = NULL) { + if (is.null(m_prec)) { + m_prec <- max(nchar(sub("^[0-9]*", "", mean)) - 1, 0) + } + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", sd)) - 1, 0) + } + + assert_count(m_prec) + assert_count(sd_prec) + assert_count(n_obs) + assert_count(n_items) + assert_number(mean) + assert_number(sd) + + effective_n = n_obs * n_items + + # Applies the GRIM test, and computes the possible mean. + sum <- mean * effective_n + realsum <- round(sum) + realmean <- realsum / effective_n + + #Checks whether mean and SD are within possible range + if (!is.null(min_val) & !is.null(max_val)) { + if (mean < min_val | mean > max_val) { + warning("The mean must be between the scale minimum and maximum") + return(FALSE) + } + sd_limits <- .sd_limits(n_obs, mean, min_val, max_val, sd_prec, n_items) + if (sd < sd_limits[1] | sd > sd_limits[2]) { + warning("Given the scale minimum and maximum, the standard deviation has to be between ", sd_limits[1], " and ", sd_limits[2], ".") + return(FALSE) + } + } + # Creates functions to round a number consistently up or down, when the last digit is 5 + round_down <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + floor(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + round_up <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + ceiling(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + # Applies the GRIM test, to see whether the reconstituted mean is the same as the reported mean (with both down and up rounding) + + consistent_down <- round_down(number = realmean, decimals = m_prec) == mean + consistent_up <- round_up(number = realmean, decimals = m_prec) == mean + + if (!consistent_down & !consistent_up) { + warning("GRIM inconsistent - so GRIMMER test cannot be run. See ?GRIM_test") + return(FALSE) + } + + # Computes the lower and upper bounds for the sd. + + Lsigma <- ifelse(sd < 5 / (10^(sd_prec+1)), 0, sd - 5 / (10^(sd_prec+1))) + Usigma <- sd + 5 / (10^(sd_prec+1)) + + # Computes the lower and upper bounds for the sum of squares of items. + + lower_bound <- ((n_obs - 1) * Lsigma^2 + n_obs * realmean^2)*n_items^2 + upper_bound <- ((n_obs - 1) * Usigma^2 + n_obs * realmean^2)*n_items^2 + + # Checks that there is at least an integer between the lower and upperbound + + if (ceiling(lower_bound) > floor(upper_bound)) { + return(FALSE) + } + + # Takes a vector of all the integers between the lowerbound and upperbound + + possible_integers <- ceiling(lower_bound):floor(upper_bound) + + # Creates the predicted variance and sd + + Predicted_Variance <- (possible_integers/n_items^2 - n_obs * realmean^2) / (n_obs - 1) + Predicted_SD <- sqrt(Predicted_Variance) + + # Computes whether one Predicted_SD matches the SD (trying to round both down and up) + + Rounded_SD_down <- round_down(Predicted_SD, sd_prec) + Rounded_SD_up <- round_up(Predicted_SD, sd_prec) + + Matches_SD <- Rounded_SD_down == sd | Rounded_SD_up == sd + + if (!any(Matches_SD)) { + return(FALSE) + } + + # Computes whether there is an integer of the correct oddness between the lower and upper bounds. + oddness <- realsum %% 2 + Matches_Oddness <- possible_integers %% 2 == oddness + return(any(Matches_SD & Matches_Oddness)) + + return(TRUE) +} + diff --git a/special-scripts/grimmer-rsprite2-original/plot-functions.R b/special-scripts/grimmer-rsprite2-original/plot-functions.R new file mode 100644 index 00000000..e4093400 --- /dev/null +++ b/special-scripts/grimmer-rsprite2-original/plot-functions.R @@ -0,0 +1,109 @@ +#' Plot distributions +#' +#' This plots distributions identified by \code{\link{find_possible_distributions}} using ggplot2. +#' They can be shown as histograms or as \href{https://towardsdatascience.com/what-why-and-how-to-read-empirical-cdf-123e2b922480}{cumulative distributions (ECDF) plots}. The latter give +#' more information, yet not all audiences are familiar with them. +#' +#' @param distributions Tibble with a column `distribution` and an identifier (`id`), typically as returned from \code{\link{find_possible_distributions}}. +#' @param plot_type Plot multiple histograms, or overlapping cumulative distribution plots, or density plots? "auto" is to plot histograms if up to 9 distributions are passed, or if there are fewer than 10 discrete values, and empirical cumulative distribution plots otherwise +#' @param max_plots How many distributions should *at most* be plotted? If more are passed, this number is randomly selected. +#' @param show_ids Should ids of the distributions be shown with ecdf and density charts? Defaults to no, since the default ids are not meaningful. +#' @param facets Should distributions be shown in one chart or in multiple small charts? Only considered for ecdf and density charts, histograms are always shown in facets +#' +#' @return A ggplot2 object that can be styled with functions such as \code{\link[ggplot2]{labs}} or \code{\link[ggplot2]{theme_linedraw}} + +#' @examples +#' sprite_parameters <- set_parameters(mean = 2.2, sd = 1.3, n_obs = 20, +#' min_val = 1, max_val = 5) +#' +#' poss <- find_possible_distributions(sprite_parameters, 5, seed = 1234) +#' +#' # All distributions in same plot +#' plot_distributions(poss, plot_type = "ecdf") +#' +#' # Separate plot for each distribution +#' plot_distributions(poss, plot_type = "ecdf", facets = TRUE) +#' +#' @export + +plot_distributions <- function(distributions, plot_type = c("auto", "histogram", "ecdf", "density"), + max_plots = 100, show_ids = FALSE, facets = NULL) { + .check_req_packages(c("tidyr", "ggplot2", "rlang")) + + # To avoid depending on rlang, this cannot be imported + .data <- rlang::.data + + distribution <- id <- NULL #To avoid "no visible binding" CMD check note + + + assert_tibble(distributions) + assert_subset(c("id", "distribution"), names(distributions)) + assert_choice(plot_type[1], c("auto", "histogram", "ecdf", "density")) + plot_type <- plot_type[1] + + if (any(duplicated(distributions$id))) { + warning("id column should not contain duplicates. Replaced by row number instead.") + distributions$id <- seq_along(distributions$id) + } + distributions$id <- factor(distributions$id) + + if (nrow(distributions) > max_plots) { + message("Number of distributions passed exceeds max_plots parameter. ", max_plots, " will be randomly selected for plotting.") + distributions <- distributions[sample(seq_along(distributions$id), max_plots), ] + } + + n_distributions <- nrow(distributions) + distributions_long <- tidyr::unnest_longer(distributions, distribution) + unique_vals <- length(unique(distributions_long$distribution)) + + if("sprite_distributions" %in% class(distributions)) { + params <- attr(distributions, "parameters") + scale_min <- params$min_val + scale_max <- params$max_val + } else { + scale_min <- min(distributions_long$distribution) + scale_max <- max(distributions_long$distribution) + } + + + if (plot_type == "auto") { + plot_type <- ifelse(n_distributions > 9 & unique_vals > 9, "ecdf", "histogram") + } + + if (plot_type == "histogram") { + if (!is.null(facets) && !facets) { + warning("Histograms will always be shown in separate facets, facets argument ignored.") + } + facets <- TRUE + } else if (is.null(facets)) { + facets <- FALSE + } + + assert_logical(facets) + + p <- ggplot2::ggplot(distributions_long, ggplot2::aes(x = .data$distribution)) + ggplot2::theme_light() + + ggplot2::scale_x_continuous(limits = c(scale_min, scale_max)) + + + if (plot_type == "histogram") { + bins <- min(30, unique_vals) + p <- p + ggplot2::geom_histogram(bins = bins) + } + + if (plot_type == "density") { + p <- p + ggplot2::geom_density(ggplot2::aes(color = .data$id), alpha = 5 / (5 + log(n_distributions)), show.legend = show_ids) + + ggplot2::labs(x = "Response", color = "id") + } + + if (plot_type == "ecdf") { + p <- p + ggplot2::stat_ecdf(ggplot2::aes(color = .data$id), alpha = 5 / (5 + log(n_distributions)), show.legend = show_ids) + + ggplot2::labs(x = "Response", color = "id", y = "Cumulative share") + + ggplot2::scale_y_continuous(labels = scales::percent) + } + if (facets) { + p <- p + + ggplot2::facet_wrap(ggplot2::vars(id), nrow = ceiling(sqrt(n_distributions))) + } + p +} + diff --git a/special-scripts/grimmer-rsprite2-original/rsprite2-package.R b/special-scripts/grimmer-rsprite2-original/rsprite2-package.R new file mode 100644 index 00000000..06140fc7 --- /dev/null +++ b/special-scripts/grimmer-rsprite2-original/rsprite2-package.R @@ -0,0 +1,57 @@ +#' @import checkmate +#' @importFrom Rdpack reprompt +#' @importFrom stats runif sd +#' @importFrom utils head + +## Global variables taken from rSPRITE +# Parameters that trade off speed versus completeness. +# maxDeltaLoopsLower controls how many times we tweak pairs of numbers before giving up hope of finding any solution; +# it is the lower bound on a formula that includes the sample size and range. +# maxDeltaLoopsUpper is the absolute upper bound on that formula (a sanity check, in effect). +# maxDupLoops controls how many times we try to find another unique solution, when we know that at least one exists. +rSprite.maxDeltaLoopsLower <- 20000 +rSprite.maxDeltaLoopsUpper <- 1000000 +rSprite.maxDupLoops <- 20 + +rSprite.dust <- 1e-12 #To account for rounding errors +rSprite.huge <- 1e15 #Should this not be Inf? + +#' @keywords internal +"_PACKAGE" + +# The following block is used by usethis to automatically manage +# roxygen namespace tags. Modify with care! +## usethis namespace: start +## usethis namespace: end +NULL + +.check_req_packages <- function(x, note = "") { + res <- unlist(suppressWarnings(lapply(x, requireNamespace, quietly = TRUE))) + if (!all(res)) { + if (!interactive()) { + stop(paste0(note, "Some required packages are not installed. Make sure you have + these packages: ", paste0(x[!res], collapse = ", ")), + call. = FALSE + ) + } + op <- options("warn") + on.exit(options(op)) + options(warn = 1) + warning(paste0(note, "The following packages are required for this function but + cannot be loaded: ", paste0(x[!res], collapse = ", ")), + call. = FALSE + ) + choice <- readline(prompt = "Should I try to install these packages? (Y/N)") + if (choice %in% c("Y", "y")) { + utils::install.packages(x[!res]) + res <- unlist(suppressWarnings(lapply(x, requireNamespace, quietly = TRUE))) + if (!all(res)) { + stop("Not all packages could be installed successfully. The following could still not be loaded: ", paste0(x[!res], collapse = ", "), + call. = FALSE + ) + } + return(TRUE) + } + stop("Cannot proceed without these packages.", call. = FALSE) + } +} diff --git a/special-scripts/replace-in-file.R b/special-scripts/replace-in-file.R new file mode 100644 index 00000000..5477a3da --- /dev/null +++ b/special-scripts/replace-in-file.R @@ -0,0 +1,63 @@ + +# Helper functions used in the grimmer-replace-names.R file. As part of the +# special-scripts/ directory listed in .Rbuildignore, they are NOT part of the +# built package. See: https://r-pkgs.org/structure.html#sec-rbuildignore + +# These functions are taken from an MIT-licensed repo: +# https://github.com/lhdjung/lukas_jung_blog/blob/master/posts/replace-in-file/index.qmd + + +replace_in_file <- function(path, + pattern, + replacement, + whole_word = TRUE, + ignore_case = FALSE, + fixed = FALSE, + use_bytes = FALSE) { + # Read the content of the file into a variable: + file_content_old <- readLines(path) + + if (whole_word) { + pattern <- paste0("\\b", pattern, "\\b") + } + + # Replace the word in the content: + file_content_new <- gsub( + pattern = pattern, + replacement = replacement, + x = file_content_old, + ignore.case = ignore_case, + fixed = fixed, + useBytes = use_bytes + ) + + # Write the updated content back to the file: + writeLines(file_content_new, path) +} + +# Function that goes to the file at `path` and takes `df_names` as a lookup +# table to replace the values in the `col_pattern` column by those in +# `col_replacement`: +replace_from_df <- function(path, + df_names, + col_pattern, + col_replacement, + whole_word = TRUE, + ignore_case = FALSE, + fixed = FALSE, + use_bytes = FALSE) { + if (!all(c(col_pattern, col_replacement) %in% colnames(df_names))) { + stop("`col_pattern` and `col_replacement` must be column names of `df_names`.") + } + for (i in seq_len(nrow(df_names))) { + replace_in_file( + path = path, + pattern = df_names[[col_pattern]][i], + replacement = df_names[[col_replacement]][i], + whole_word = whole_word, + ignore_case = ignore_case, + fixed = fixed, + use_bytes = use_bytes + ) + } +} diff --git a/tests/testthat/test-grim-map-seq.R b/tests/testthat/test-grim-map-seq.R index a6909926..3f938324 100644 --- a/tests/testthat/test-grim-map-seq.R +++ b/tests/testthat/test-grim-map-seq.R @@ -25,40 +25,36 @@ pigs1_exp <- tibble::tibble( "0.24", "0.24", "0.24", "0.24", "0.24", "0.24", "0.24" ), n = c( - 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 29L, 29L, - 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 24L, 24L, 24L, 24L, 24L, - 24L, 24L, 24L, 24L, 24L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, - 27L, 27L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 27L, - 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 31L, 31L, 31L, 31L, - 31L, 31L, 31L, 31L, 31L, 31L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, - 28L, 28L, 28L, 20L, 21L, 22L, 23L, 24L, 26L, 27L, 28L, 29L, 30L, - 24L, 25L, 26L, 27L, 28L, 30L, 31L, 32L, 33L, 34L, 19L, 20L, 21L, - 22L, 23L, 25L, 26L, 27L, 28L, 29L, 22L, 23L, 24L, 25L, 26L, 28L, - 29L, 30L, 31L, 32L, 24L, 25L, 26L, 27L, 28L, 30L, 31L, 32L, 33L, - 34L, 22L, 23L, 24L, 25L, 26L, 28L, 29L, 30L, 31L, 32L, 26L, 27L, - 28L, 29L, 30L, 32L, 33L, 34L, 35L, 36L, 23L, 24L, 25L, 26L, 27L, - 29L, 30L, 31L, 32L, 33L + 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 29L, 29L, 29L, 29L, 29L, + 29L, 29L, 29L, 29L, 29L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, + 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 29L, 29L, 29L, 29L, 29L, + 29L, 29L, 29L, 29L, 29L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, + 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 28L, 28L, 28L, 28L, 28L, + 28L, 28L, 28L, 28L, 28L, 20L, 21L, 22L, 23L, 24L, 26L, 27L, 28L, 29L, 30L, + 24L, 25L, 26L, 27L, 28L, 30L, 31L, 32L, 33L, 34L, 19L, 20L, 21L, 22L, 23L, + 25L, 26L, 27L, 28L, 29L, 22L, 23L, 24L, 25L, 26L, 28L, 29L, 30L, 31L, 32L, + 24L, 25L, 26L, 27L, 28L, 30L, 31L, 32L, 33L, 34L, 22L, 23L, 24L, 25L, 26L, + 28L, 29L, 30L, 31L, 32L, 26L, 27L, 28L, 29L, 30L, 32L, 33L, 34L, 35L, 36L, + 23L, 24L, 25L, 26L, 27L, 29L, 30L, 31L, 32L, 33L ), consistency = c( - FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, - FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, - FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, - FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, - TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, - FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, - FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, - FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, - FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, - TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, - TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, - FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, - TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, - FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, - FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, - FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, - TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE + FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, + FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, + TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, + FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, + TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, + TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, + FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, + FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, + FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, + FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, + FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, + FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, + FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, + FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, + TRUE, FALSE, FALSE, FALSE, TRUE ), - ratio = c( + probability = c( 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.73, 0.73, 0.73, 0.73, 0.73, 0.73, 0.73, 0.73, 0.73, @@ -102,29 +98,23 @@ pigs2_exp <- tibble::tibble( "0.554", "0.554", "0.554", "0.554" ), n = c( - 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, - 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, - 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, - 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, - 150L, 150L, 150L, 150L, 150L, 150L, 150L, 145L, 146L, 147L, 148L, - 149L, 151L, 152L, 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, - 151L, 152L, 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, 151L, - 152L, 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, 151L, 152L, - 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, 151L, 152L, 153L, - 154L, 155L + 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, + 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, + 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, + 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 150L, 145L, 146L, + 147L, 148L, 149L, 151L, 152L, 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, + 151L, 152L, 153L, 154L, 155L, 145L, 146L, 147L, 148L, 149L, 151L, 152L, 153L, + 154L, 155L, 145L, 146L, 147L, 148L, 149L, 151L, 152L, 153L, 154L, 155L, 145L, + 146L, 147L, 148L, 149L, 151L, 152L, 153L, 154L, 155L ), consistency = rep( c( - FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, - TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, - TRUE, FALSE + FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, + FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE ), - c( - 4L, 1L, 8L, 1L, 5L, 2L, 5L, 1L, 7L, 1L, 9L, 1L, 22L, 1L, 1L, - 1L, 4L, 1L, 18L, 1L, 6L - ) + c(4L, 1L, 8L, 1L, 5L, 2L, 5L, 1L, 7L, 1L, 9L, 1L, 22L, 1L, 1L, 1L, 4L, 1L, 18L, 1L, 6L) ), - ratio = c( + probability = c( 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85, @@ -171,33 +161,30 @@ pigs1_include_reported_exp <- tibble::tibble( "0.24", "0.24", "0.24", "0.24", "0.24" ), n = c( - 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 29L, - 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 24L, 24L, 24L, - 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 27L, 27L, 27L, 27L, 27L, - 27L, 27L, 27L, 27L, 27L, 27L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, - 29L, 29L, 29L, 29L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, - 27L, 27L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, - 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 20L, 21L, - 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 24L, 25L, 26L, 27L, - 28L, 29L, 30L, 31L, 32L, 33L, 34L, 19L, 20L, 21L, 22L, 23L, 24L, - 25L, 26L, 27L, 28L, 29L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, - 30L, 31L, 32L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, - 34L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 26L, - 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 23L, 24L, 25L, - 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L + 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 29L, 29L, 29L, 29L, + 29L, 29L, 29L, 29L, 29L, 29L, 29L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, + 24L, 24L, 24L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 29L, + 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 27L, 27L, 27L, 27L, 27L, + 27L, 27L, 27L, 27L, 27L, 27L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, 31L, + 31L, 31L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 28L, 20L, 21L, + 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 24L, 25L, 26L, 27L, 28L, 29L, + 30L, 31L, 32L, 33L, 34L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, + 29L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 24L, 25L, 26L, + 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, + 29L, 30L, 31L, 32L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, + 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L ), consistency = rep( rep(c(FALSE, TRUE), 40), c( - 3L, 1L, 3L, 1L, 6L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 3L, 1L, 3L, - 1L, 2L, 1L, 3L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, - 1L, 3L, 2L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 3L, - 1L, 3L, 1L, 5L, 1L, 3L, 2L, 5L, 1L, 1L, 1L, 4L, 1L, 1L, 1L, 3L, - 1L, 3L, 2L, 12L, 1L, 5L, 2L, 4L, 2L, 6L, 2L, 2L, 1L, 3L, 1L, - 3L, 1L + 3L, 1L, 3L, 1L, 6L, 1L, 2L, 1L, 3L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 3L, + 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 5L, 1L, 3L, 1L, 3L, 2L, 2L, 1L, 2L, 1L, + 3L, 1L, 2L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 5L, 1L, 3L, 2L, 5L, 1L, 1L, + 1L, 4L, 1L, 1L, 1L, 3L, 1L, 3L, 2L, 12L, 1L, 5L, 2L, 4L, 2L, 6L, 2L, 2L, 1L, + 3L, 1L, 3L, 1L ) ), - ratio = c( + probability = c( 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.75, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.71, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.76, 0.73, 0.73, 0.73, 0.73, 0.73, 0.73, diff --git a/tests/testthat/test-grim-map-total-n.R b/tests/testthat/test-grim-map-total-n.R index 4c6e530d..ec14e8d7 100644 --- a/tests/testthat/test-grim-map-total-n.R +++ b/tests/testthat/test-grim-map-total-n.R @@ -25,23 +25,25 @@ df2_rows_1_3_expected <- tibble::tibble( "2.97", "4.42", "2.97" ), n = rep( - c(45L, 45L, 44L, 46L, 43L, 47L, 42L, 48L, 41L, 49L, 40L, 50L, - 51L, 52L, 50L, 53L, 49L, 54L, 48L, 55L, 47L, 56L, 46L, 57L), + c( + 45L, 45L, 44L, 46L, 43L, 47L, 42L, 48L, 41L, 49L, 40L, 50L, 51L, 52L, 50L, + 53L, 49L, 54L, 48L, 55L, 47L, 56L, 46L, 57L + ), 2 ), n_change = rep(c(0L, 0L, -1L, 1L, -2L, 2L, -3L, 3L, -4L, 4L, -5L, 5L), 4), consistency = rep( - c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, - TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, - TRUE, FALSE, TRUE, FALSE), - c(2L, 2L, 1L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 3L, - 3L, 3L, 2L, 3L, 1L, 3L, 1L, 5L) + c( + FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, + FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE + ), + c(2L, 2L, 1L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 3L, 1L, 3L, 1L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 1L, 5L) ), both_consistent = rep( c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE), c(2L, 2L, 6L, 2L, 16L, 2L, 18L) ), - ratio = rep( + probability = rep( c( 0.55, 0.55, 0.56, 0.54, 0.57, 0.53, 0.58, 0.52, 0.59, 0.51, 0.6, 0.5, 0.49, 0.48, 0.5, 0.47, 0.51, 0.46, 0.52, 0.45, 0.53, 0.44, 0.54, 0.43 @@ -83,7 +85,7 @@ test_that("It has correct values", { colnames_exp <- c( - "x", "n", "n_change", "consistency", "both_consistent", "ratio", "case", "dir" + "x", "n", "n_change", "consistency", "both_consistent", "probability", "case", "dir" ) diff --git a/tests/testthat/test-grim-map.R b/tests/testthat/test-grim-map.R index 9d174aa0..5ddb4ad5 100644 --- a/tests/testthat/test-grim-map.R +++ b/tests/testthat/test-grim-map.R @@ -41,10 +41,10 @@ test_that("It has the correct rounding-specific class", { t <- TRUE f <- FALSE -cns_exp <- c(t, f, f, f, f, t, f, t, f, f, t, f) +consistency_exp <- c(t, f, f, f, f, t, f, t, f, f, t, f) test_that("`consistency` has the correct values", { - df1_grim$consistency %>% expect_equal(cns_exp) + df1_grim$consistency %>% expect_equal(consistency_exp) }) @@ -52,14 +52,11 @@ test_that("`consistency` has the correct values", { df2 <- df1 %>% dplyr::mutate(n = n * 100) -df2_grim <- grim_map(df2, show_prob = TRUE) +df2_grim <- grim_map(df2) -test_that("`prob` is zero if `ratio` is negative", { - (df2_grim$prob == 0) %>% all() %>% expect_true() -}) - -test_that("`prob` is greater than `ratio` if `ratio` is negative", { - (df2_grim$prob > df2_grim$ratio) %>% all() %>% expect_true() +# Comparison with what `grim_ratio()` would return -- it can be negative: +test_that("`probability` is zero if `ratio` would be negative", { + (df2_grim$probability == 0) %>% all() %>% expect_true() }) @@ -80,23 +77,18 @@ df3_percent_true <- grim_map(df3, percent = TRUE, show_rec = TRUE) %>% df3_percent_false <- grim_map(df3, show_rec = TRUE) -all_percent_ratios_greater <- - (df3_percent_true$ratio > df3_percent_false$ratio) %>% - all() +percent_probabilities_greater <- + df3_percent_true$probability > df3_percent_false$probability test_that( - "The GRIM ratio is always greater with `percent = TRUE` than without it", { - all_percent_ratios_greater %>% expect_true() + "The probability of GRIM inconsistency is always greater + with `percent = TRUE` than without it", { + percent_probabilities_greater %>% all() %>% expect_true() }) df3_true_accord <- df3_percent_true %>% - dplyr::select(1, 3, 7:10) %>% - # dplyr::select( - # x, consistency, - # rec_x_upper_rounded_up, rec_x_upper_rounded_down, - # rec_x_lower_rounded_up, rec_x_lower_rounded_down - # ) %>% + dplyr::select(1, 3, 7:11) %>% dplyr::mutate(accord = dplyr::if_else( consistency, any(dplyr::near( @@ -152,21 +144,10 @@ test_that("`x` and `n` make the specified columns take on these roles", { }) -df7 <- df1 %>% - grim_map(show_prob = TRUE) - - -test_that("`show_prob` adds a `prob` column", { - expect_contains(colnames(df7), "prob") -}) - - -prob <- df7$prob -ratio_censored <- dplyr::if_else(df7$ratio < 0, 0, df7$ratio) +# `df7` was omitted -test_that("the probability of GRIM-inconsisteny is equal to the - non-negative GRIM ratio", { - prob %>% expect_equal(ratio_censored) +test_that("a `probability` column is naturally present", { + df1_grim %>% colnames() %>% expect_contains("probability") }) diff --git a/tests/testthat/test-grim-stats.R b/tests/testthat/test-grim-stats.R index 823d495b..b5f788ff 100644 --- a/tests/testthat/test-grim-stats.R +++ b/tests/testthat/test-grim-stats.R @@ -12,28 +12,31 @@ test_that("`grim_total()` works with percentage conversion", { exp1 <- seq(from = 0.8, to = 0.7, by = -0.01) -exp2 <- seq(from = 0.9955, to = 0.9945, by = -0.0001) +exp2 <- c( + 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 +) +exp3 <- seq(from = 0.9955, to = 0.9945, by = -0.0001) test_that("`grim_ratio()` returns correct values", { grim_ratio("5.30", 20:30) %>% expect_equal(exp1) }) - -test_that("`grim_ratio()` works with percentage conversion", { - grim_ratio("87.50", 45:55, percent = TRUE) %>% expect_equal(exp2) +test_that("`grim_probability()` returns correct values", { + grim_probability("5.30", 90:110) %>% expect_equal(exp2) }) -test_that("`grim_ratio_upper()` returns correct values", { - grim_ratio_upper(0.1) %>% expect_equal(0.9) - grim_ratio_upper(0.11) %>% expect_equal(0.99) - grim_ratio_upper(0.111) %>% expect_equal(0.999) +test_that("`grim_ratio()` works with percentage conversion", { + grim_ratio("87.50", 45:55, percent = TRUE) %>% expect_equal(exp3) }) -test_that("`grim_ratio_upper()` works with percentage conversion", { - grim_ratio_upper(0.1, percent = TRUE) %>% expect_equal(0.999) - grim_ratio_upper(0.11, percent = TRUE) %>% expect_equal(0.9999) - grim_ratio_upper(0.111, percent = TRUE) %>% expect_equal(0.99999) +# Errors ------------------------------------------------------------------ + +test_that("all functions error if `x` is not a string", { + grim_probability(5.30, 20) %>% expect_error() + grim_ratio(5.30, 20) %>% expect_error() + grim_total(5.30, 20:30) %>% expect_error() }) diff --git a/tests/testthat/test-grimmer-map.R b/tests/testthat/test-grimmer-map.R index 647f394f..db1192cd 100644 --- a/tests/testthat/test-grimmer-map.R +++ b/tests/testthat/test-grimmer-map.R @@ -17,15 +17,11 @@ pigs5_exp <- tibble::tibble( "4.18", "2.18", "6.43" ), n = c(38, 31, 35, 30, 33, 34, 35, 32, 33, 37, 31, 34), - consistency = c( - FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, - TRUE, TRUE, TRUE - ), + consistency = c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE), reason = c( "GRIM inconsistent", "Passed all", "GRIMMER inconsistent (test 3)", - "GRIMMER inconsistent (test 3)", "GRIM inconsistent", "Passed all", - "GRIM inconsistent", "Passed all", "GRIM inconsistent", "Passed all", - "Passed all", "Passed all" + "Passed all", "GRIM inconsistent", "Passed all", "GRIM inconsistent", + "Passed all", "GRIM inconsistent", "Passed all", "Passed all", "Passed all" ), ) %>% structure( diff --git a/tests/testthat/test-grimmer.R b/tests/testthat/test-grimmer.R index d3b86bb2..79275675 100644 --- a/tests/testthat/test-grimmer.R +++ b/tests/testthat/test-grimmer.R @@ -89,6 +89,128 @@ aGrimmer <- function(n, mean, SD, decimals_mean = 2, decimals_SD = 2){ } +# Modified function in rsprite2 ------------------------------------------- + +# (Note the few changes I made here, explained at the appropriate places within +# the function in comments starting on "IN SCRUTINY".) + +GRIMMER_test <- function(mean, sd, n_obs, m_prec = NULL, sd_prec = NULL, n_items = 1, min_val = NULL, max_val = NULL) { + if (is.null(m_prec)) { + m_prec <- max(nchar(sub("^[0-9]*", "", mean)) - 1, 0) + } + + if (is.null(sd_prec)) { + sd_prec <- max(nchar(sub("^[0-9]*", "", sd)) - 1, 0) + } + + # IN SCRUTINY: removed calls to functions from the checkmate package -- + # scrutiny shouldn't depend on it, and they only checked input formats that + # were a given here anyway. + + effective_n = n_obs * n_items + + # Applies the GRIM test, and computes the possible mean. + sum <- mean * effective_n + realsum <- round(sum) + realmean <- realsum / effective_n + + #Checks whether mean and SD are within possible range + if (!is.null(min_val) & !is.null(max_val)) { + if (mean < min_val | mean > max_val) { + warning("The mean must be between the scale minimum and maximum") + return(FALSE) + } + sd_limits <- .sd_limits(n_obs, mean, min_val, max_val, sd_prec, n_items) + if (sd < sd_limits[1] | sd > sd_limits[2]) { + warning("Given the scale minimum and maximum, the standard deviation has to be between ", sd_limits[1], " and ", sd_limits[2], ".") + return(FALSE) + } + } + # Creates functions to round a number consistently up or down, when the last digit is 5 + round_down <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + floor(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + round_up <- function(number, decimals = 2) { + to_round <- number * 10^(decimals + 1) - floor(number * 10^(decimals)) * 10 + number_rounded <- ifelse(to_round == 5, + ceiling(number * 10^decimals) / 10^decimals, + round(number, digits = decimals)) + return(number_rounded) + } + + # Applies the GRIM test, to see whether the reconstituted mean is the same as the reported mean (with both down and up rounding) + + consistent_down <- round_down(number = realmean, decimals = m_prec) == mean + consistent_up <- round_up(number = realmean, decimals = m_prec) == mean + + if (!consistent_down & !consistent_up) { + # IN SCRUTINY: outcommented the below warning (and turned it into a message) + # that was thrown when the inputs were GRIM-inconsistent. I think one + # inconsistency should (essentially) be treated like any other, and such a + # warning is not desirable when testing. It can be incommented to check when + # this function thinks the inputs are GRIM-inconsistent! + + # message("GRIM inconsistent - so GRIMMER test cannot be run. See ?GRIM_test") + return(FALSE) + } + + # Computes the lower and upper bounds for the sd. + + Lsigma <- ifelse(sd < 5 / (10^(sd_prec+1)), 0, sd - 5 / (10^(sd_prec+1))) + Usigma <- sd + 5 / (10^(sd_prec+1)) + + # Computes the lower and upper bounds for the sum of squares of items. + + lower_bound <- ((n_obs - 1) * Lsigma^2 + n_obs * realmean^2)*n_items^2 + upper_bound <- ((n_obs - 1) * Usigma^2 + n_obs * realmean^2)*n_items^2 + + # Checks that there is at least an integer between the lower and upperbound + + if (ceiling(lower_bound) > floor(upper_bound)) { + # # IN SCRUTINY: added message + # message("Failed test 1") + return(FALSE) + } + + # Takes a vector of all the integers between the lowerbound and upperbound + + possible_integers <- ceiling(lower_bound):floor(upper_bound) + + # Creates the predicted variance and sd + + Predicted_Variance <- (possible_integers/n_items^2 - n_obs * realmean^2) / (n_obs - 1) + Predicted_SD <- sqrt(Predicted_Variance) + + # Computes whether one Predicted_SD matches the SD (trying to round both down and up) + + Rounded_SD_down <- round_down(Predicted_SD, sd_prec) + Rounded_SD_up <- round_up(Predicted_SD, sd_prec) + + Matches_SD <- Rounded_SD_down == sd | Rounded_SD_up == sd + + if (!any(Matches_SD)) { + # # IN SCRUTINY: added message + # message("Failed test 2") + return(FALSE) + } + + # Computes whether there is an integer of the correct oddness between the lower and upper bounds. + oddness <- realsum %% 2 + Matches_Oddness <- possible_integers %% 2 == oddness + + if (!any(Matches_SD & Matches_Oddness)) { + # # IN SCRUTINY: added message + # message("Failed test 3") + return(FALSE) + } + + return(TRUE) +} @@ -147,8 +269,18 @@ df1 <- df1 %>% # Apply both functions, modified and original, to data frames containing the # same data but (possibly) different column names: + +start1 <- Sys.time() out1 <- purrr::pmap_lgl(df1, grimmer_scalar) +end1 <- Sys.time() +diff1 <- difftime(end1, start1, units = "secs") +message("\nApplying `grimmer_scalar()` took:\n", round(diff1, 2), " seconds\n") + +start2 <- Sys.time() out2 <- purrr::pmap_chr(df2, aGrimmer) +end2 <- Sys.time() +diff2 <- difftime(end2, start2, units = "secs") +message("Applying `aGrimmer()` took:\n", round(diff2, 2), " seconds\n") # Convert the original function's string output to logical so that it will be # comparable to scrutiny-style Boolean output: @@ -171,11 +303,14 @@ disagree_rate <- nrow(df_disagree) / nrow(df_out) disagree_rate +message("The rate of disagreement between implementations is ", round(disagree_rate, 2)) + df_disagree_out1_true <- df_disagree %>% dplyr::filter(out1) - +# Proportion of cases within the disagreements where `grimmer_scalar()` thinks +# the inputs are consistent but `aGrimmer()` thinks they are not: disagree_new_impl_true_rate <- nrow(df_disagree_out1_true) / nrow(df_disagree) disagree_new_impl_true_rate @@ -195,3 +330,92 @@ test_that("the two functions disagree on less than 3 percent of cases", { }) + +# Resolve disagreements --------------------------------------------------- + +# TODO: resolve disagreements between the implementations! Here are the only +# disagreements that occurred in hundreds of thousands of simulated test cases: +# (Note that `out1` is the result of `grimmer_scalar()`, and `out2` of +# `aGrimmer()`. Most important is that `n` is always 40 or 80!) +c(n = "40", x = "16.03", sd = "6.41", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "40", x = "64.73", sd = "25.89", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "80", x = "64.73", sd = "25.89", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "80", x = "32.68", sd = "13.07", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "40", x = "64.27", sd = "25.71", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "80", x = "16.22", sd = "6.49", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "40", x = "256.03", sd = "102.41", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") +c(n = "40", x = "519.93", sd = "207.97", out1 = "TRUE", out2 = "FALSE", digits_sd = "2") + +# # Use this to get a vector such as above: +# df_disagree %>% dplyr::slice(1) %>% unlist() %>% constructive::construct(one_liner = TRUE) + +# Here they are in tibble form. Run `GRIMMER_test()` on them and see whether +# this is all due to GRIM's 40/80 leniency! +df_disagree_all <- tibble::tibble( + n = c(40, 40, 80, 80, 40, 80, 40, 40, 40, 40, 80, 40, 40, 40), + x = c( + "16.03", "64.73", "64.73", "32.68", "64.27", "16.22", "256.03", "519.93", + "32.32", "256.03", "512.33", "512.93", "513.07", "518.93" + ), + sd = c( + "6.41", "25.89", "25.89", "13.07", "25.71", "6.49", "102.41", "207.97", + "12.93", "102.41", "204.93", "205.17", "205.23", "207.57" + ), +) %>% + dplyr::relocate(x, sd, n) %>% + dplyr::mutate(n = as.numeric(n)) + +# See if there are warnings about GRIM (!) when mapping `GRIMMER_test()`: +df_disagree_all %>% + dplyr::rename(mean = x, n_obs = n) %>% + dplyr::mutate(mean = as.numeric(mean), sd = as.numeric(sd)) %>% + purrr::pmap(GRIMMER_test) + + + +# New implementation from rsprite2 ---------------------------------------- + +test_that("GRIMMER works correctly by default", { + expect_true(grimmer_scalar("5.21", "1.6", 28)) + expect_false(grimmer_scalar("3.44", "2.47", 18)) +}) + +test_that("GRIMMER works correctly when compared to the rsprite2 implementation", { + grimmer_scalar("1.2", "0.3", 57) %>% expect_equal(GRIMMER_test(1.2, 0.3, 57)) + grimmer_scalar("8.3", "7.5", 103) %>% expect_equal(GRIMMER_test(8.3, 7.5, 103)) + + # Dealing with test-3 inconsistencies: + grimmer_scalar("5.23", "2.55", 35) %>% expect_equal(GRIMMER_test(5.23, 2.55, 35)) + grimmer_scalar("5.23", "2.55", 127) %>% expect_equal(GRIMMER_test(5.23, 2.55, 127)) + grimmer_scalar("5.2" , "2.5" , 35) %>% expect_equal(GRIMMER_test(5.2 , 2.5 , 35)) + + # This value set is from `pigs5`. It used to be flagged as a test-3 + # inconsistency by `grimmer_scalar()`, but it is consistent according to both + # the new version and rsprite2: + grimmer_scalar("2.57", "2.57", 30) %>% expect_equal(GRIMMER_test(2.57, 2.57, 30)) + + # Some finer variations: + grimmer_scalar("3.756", "4.485", 89) %>% expect_equal(GRIMMER_test(3.756, 4.485, 89)) + grimmer_scalar("3.756", "4.485", 12) %>% expect_equal(GRIMMER_test(3.756, 4.485, 12)) + grimmer_scalar("3.75", "4.48", 12) %>% expect_equal(GRIMMER_test(3.75, 4.48, 12)) + grimmer_scalar("3.75", "4.48", 89) %>% expect_equal(GRIMMER_test(3.75, 4.48, 89)) +}) + +test_that("GRIMMER works correctly with `items = 2`", { + grimmer_scalar("5.21", "1.60", 28, items = 2) %>% expect_equal(GRIMMER_test(5.21, 1.6 , 28, n_items = 2)) + grimmer_scalar("3.44", "2.47", 18, items = 2) %>% expect_equal(GRIMMER_test(3.44, 2.47, 18, n_items = 2)) +}) + +test_that("GRIMMER works correctly with `items = 3`", { + grimmer_scalar("5.21", "1.60", 28, items = 3) %>% expect_equal(GRIMMER_test(5.21, 1.6 , 28, n_items = 3)) + grimmer_scalar("3.44", "2.47", 18, items = 3) %>% expect_equal(GRIMMER_test(3.44, 2.47, 18, n_items = 3)) +}) + + + +# test_that("sd_bounds_measure works", { +# expect_equal(c(.45, 3.03), sd_bounds_measure(n = 5, x = 4.2, min_val = 1, max_val = 7, sd_prec = 2)) +# expect_equal(c(.27, 3.03), sd_bounds_measure(n = 5, x = 4.2, min_val = 1, max_val = 7, sd_prec = 2, items = 2)) +# expect_equal(c(0, 0), sd_bounds_measure(n = 100, x = 1, min_val = 1, max_val = 7)) +# }) + diff --git a/tests/testthat/test-method-audit-seq.R b/tests/testthat/test-method-audit-seq.R index 34ad2799..0e523716 100644 --- a/tests/testthat/test-method-audit-seq.R +++ b/tests/testthat/test-method-audit-seq.R @@ -20,21 +20,21 @@ grim_exp <- tibble::tibble( grimmer_exp <- tibble::tibble( x = c("7.22", "5.23"), sd = c("5.30", "2.55"), - n = c(38, 35), + n = c(38L, 35L), consistency = c(FALSE, FALSE), - hits_total = c(8L, 16L), + hits_total = c(8L, 15L), hits_x = c(4L, 2L), hits_sd = c(0L, 10L), - hits_n = c(4L, 4L), - diff_x = c(1, 3), - diff_x_up = c(2, 3), - diff_x_down = c(-1, -3), - diff_sd = c(NA, 1), - diff_sd_up = c(NA, 1), - diff_sd_down = c(NA, -1), - diff_n = c(1, 4), - diff_n_up = c(2, 4), - diff_n_down = c(-1, -4), + hits_n = 4:3, + diff_x = c(1L, 3L), + diff_x_up = 2:3, + diff_x_down = c(-1L, -3L), + diff_sd = c(NA, 1L), + diff_sd_up = c(NA, 1L), + diff_sd_down = c(NA, -1L), + diff_n = c(1L, 4L), + diff_n_up = c(2L, 4L), + diff_n_down = c(-1L, -4L), ) %>% structure(class = c("scr_audit_seq", "tbl_df", "tbl", "data.frame")) diff --git a/tests/testthat/test-method-audit-total-n.R b/tests/testthat/test-method-audit-total-n.R index 1f4ade45..ef3b761f 100644 --- a/tests/testthat/test-method-audit-total-n.R +++ b/tests/testthat/test-method-audit-total-n.R @@ -32,13 +32,13 @@ df1_grim_exp <- tibble::tibble( df1_grimmer_exp <- tibble::tibble( term = c("hits_total", "hits_forth", "hits_back", "scenarios_total", "hit_rate"), - mean = c(1, 0.5, 0.5, 12, 0.0833333333333333287074), - sd = c(0, 0.7071067811865475727373, 0.7071067811865475727373, 0, 0), - median = c(1, 0.5, 0.5, 12, 0.0833333333333333287074), - min = c(1, 0, 0, 12, 0.0833333333333333287074), - max = c(1, 1, 1, 12, 0.0833333333333333287074), + mean = c(0, 0, 0, 12, 0), + sd = numeric(5), + median = c(0, 0, 0, 12, 0), + min = c(0, 0, 0, 12, 0), + max = c(0, 0, 0, 12, 0), na_count = numeric(5), - na_rate = numeric(5) + na_rate = numeric(5), ) df2_debit_exp <- tibble::tibble( diff --git a/tests/testthat/test-seq-predicates.R b/tests/testthat/test-seq-predicates.R index 23fa3851..12cc6312 100644 --- a/tests/testthat/test-seq-predicates.R +++ b/tests/testthat/test-seq-predicates.R @@ -138,8 +138,9 @@ test_that("`is_seq_dispersed()` passes its special tests, returning `TRUE`", { test_that("`is_seq_dispersed()` passes its special tests, returning `NA`", { - c(NA, 3:7, NA) %>% is_seq_dispersed(from = 5, test_linear = f) %>% is.na() %>% expect_true() + c(NA, 3:7, NA) %>% is_seq_dispersed(from = 5, test_linear = f) %>% is.na() %>% expect_true() c(NA, NA, 3:7, NA, NA) %>% is_seq_dispersed(from = 5, test_linear = f) %>% is.na() %>% expect_true() + c(NA, 3:6, NA, NA) %>% is_seq_dispersed(from = 5, test_linear = f) %>% is.na() %>% expect_true() }) diff --git a/tests/testthat/test-utils.R b/tests/testthat/test-utils.R index 06d94d67..f207cdd6 100644 --- a/tests/testthat/test-utils.R +++ b/tests/testthat/test-utils.R @@ -91,20 +91,8 @@ test_that("`parcel_nth_elements()` returns correct values", { df_test <- tibble::tibble( a = c(1, 2, 3, 1, 2, 3), - b = c(1, 2, 3, 1, 2, 3) -) %>% - remove_equivalent_rows() - -df_expected_1 <- tibble::tibble( - a = 1:3, - b = 1:3 ) -test_that("`remove_equivalent_rows()` returns correct values", { - df_test %>% expect_equal(df_expected_1) -}) - - df_test_2 <- tibble::tibble( a = 1, diff --git a/vignettes/debit.Rmd b/vignettes/debit.Rmd index e808eb00..4464573c 100644 --- a/vignettes/debit.Rmd +++ b/vignettes/debit.Rmd @@ -23,7 +23,7 @@ knitr::opts_chunk$set( library(scrutiny) ``` -The descriptive binary test, or DEBIT, checks whether the reported mean, standard deviation (SD), and sample size of binary data are mutually consistent [@heathers_debit_2019]. Like GRIM, it tests if a given set of summary data can describe the same distribution. +The descriptive binary test, or DEBIT, checks whether the reported mean, sample standard deviation (SD), and sample size of binary data are mutually consistent [@heathers_debit_2019]. Like GRIM, it tests if a given set of summary data can describe the same distribution. This vignette covers scrutiny's implementation of DEBIT: @@ -42,6 +42,8 @@ debit(x = "0.35", sd = "0.18", n = 20) As in `grim()`, the mean needs to be a string. (The same is true for the SD.) That is because strings preserve trailing zeros, which can be crucial for DEBIT. Numeric values don't, and even converting them to strings won't help. A workaround for larger numbers of such values, `restore_zeros()`, is discussed in the *Data wrangling* vignette. +Note that the SD is always assumed to be the sample SD, not the population SD. In some rare cases, this might possibly explain apparent inconsistencies. + `debit()` has some further arguments, but all of them can be used from within `debit_map()`. Since `debit_map()` is the more useful function in practice, the other arguments will be discussed in that context. ## Testing multiple cases diff --git a/vignettes/grim.Rmd b/vignettes/grim.Rmd index 1393a728..1a5de188 100644 --- a/vignettes/grim.Rmd +++ b/vignettes/grim.Rmd @@ -20,7 +20,6 @@ knitr::opts_chunk$set( pkgload::load_all() ``` - ```{r setup, message=FALSE} library(scrutiny) ``` @@ -37,7 +36,7 @@ This vignette covers scrutiny's implementation of the GRIM test. It has the foll 4. The visualization function `grim_plot()`. -5. Statistical benchmarks, such as granularity and the GRIM ratio. +5. Statistical benchmarks, such as granularity and the probability of GRIM inconsistency. ## Basic GRIM testing @@ -81,7 +80,7 @@ Now, simply run `grim_map()` on that data frame: grim_map(flying_pigs1) ``` -The `x` and `n` columns are the same as in the input. By default, the number of `items` composing the mean is assumed to be 1. The main result, `consistency`, is the GRIM consistency of the former three columns. On the `ratio` column, see section *The GRIM ratio*. +The `x` and `n` columns are the same as in the input. By default, the number of `items` composing the mean is assumed to be 1. The main result, `consistency`, is the GRIM consistency of the former three columns. On the `probability` column, see section *The probability of GRIM inconsistency*. ### Scale items @@ -196,11 +195,11 @@ These columns are --- 3. `incons_rate`: proportion of GRIM-inconsistent value sets. -4. `mean_grim_ratio`: average of GRIM ratios. +4. `mean_grim_prob`: average probability of GRIM inconsistency. -5. `incons_to_ratio`: ratio of `incons_rate` to `mean_ratio`. +5. `incons_to_prob`: ratio of `incons_rate` to `mean_grim_prob`. -6. `testable_cases`: number of GRIM-testable value sets (i.e., those with a positive ratio). +6. `testable_cases`: number of GRIM-testable value sets (i.e., those with a positive probability). 7. `testable_rate`: proportion of GRIM-testable value sets. @@ -239,7 +238,7 @@ With its unusual optics, this plot will probably not fit everyone's taste. The r The plot is strictly based on the laws governing GRIM. Its background raster shows all consistent (light) and inconsistent (dark) value pairs for two decimal places. Empirical values are shown in blue if consistent and red if inconsistent. Color settings and other ggplot2-typical options are available via arguments. Read about them at `grim_plot()`'s documentation. -You might notice the light vertical lines at $N = 40$ and $N = 80$: Few values are flagged as inconsistent here. This reflects `grim_map()`'s charitable default of accepting values rounded either up *or* down from 5. If a different `rounding` specification is chosen in the `grim_map()` call, the plot raster will adjust automatically (although it will often be the same as before): +You might notice the light vertical lines at $n = 40$ and $n = 80$: Few values are flagged as inconsistent here. This reflects `grim_map()`'s charitable default of accepting values rounded either up *or* down from 5. If a different `rounding` specification is chosen in the `grim_map()` call, the plot raster will adjust automatically (although it will often be the same as before): ```{r, fig.width=6, fig.height=5.5} jpap_5 %>% @@ -342,63 +341,59 @@ This example only features one case --- the `df` tibble has just a single row. I ## GRIM statistics -### The GRIM ratio +### The probability of GRIM inconsistency -#### Formula +#### General description -The `ratio` column in a tibble returned by `grim_map()` is the "GRIM ratio", i.e.: +The `probability` column in a tibble returned by `grim_map()` is the probability of GRIM inconsistency, i.e.: $$ -\frac{10^D - NL}{10^D} +P = max(0, \frac{10^D - NL}{10^D}) $$ -where $D$ is the number of decimal places in `x` (the mean or proportion), $N$ is the sample size, and $L$ is the number of scale items. Because $N, L \geq 1$, the GRIM ratio ranges from $-\infty$ to $1 - \frac{1}{10^D}$, asymptotically approaching 1. Its upper bound will be 0.9 if $D = 1$ and 0.99 if $D = 2$, etc. +where $D$ is the number of decimal places in $X$ (the mean or proportion), $N$ is the sample size, and $L$ is the number of scale items. The fraction will never be greater than 1, and the $max()$ function limits it at 0. + +Consider a mean $X$ that was ostensibly derived from integer data. It has $D$ decimal places, but is otherwise random: the integer part is irrelevant in any case, and the exact digits that occupy the $D$ decimal places are ignored. $P$, then, is the probability that $X$ is GRIM-inconsistent. Naturally, $P$ is also the proportion of inconsistent value sets with $D$ decimal places, a sample size of $N$, and $L$ scale items. + +In real-world scenarios, one would not usually assume $X$ to be random at the outset of an investigation, but this can be interesting as a contrasting assumption. If a study has many GRIM-inconsistent value sets with very high probabilities of inconsistency, it may suggest that the anomalies are not explained by small deviations from true values (although `grim_map_seq()` is more informative on this point). Rather, such a finding could raise concerns that these statistics did not come about as part of a regular research process. #### Functions -`grim_ratio()` takes the arguments `x`, `n`, `items`, and `percent` as in `grim()` and `grim_map()`: +`grim_probability()` takes the arguments `x`, `n`, `items`, and `percent` as in `grim()` and `grim_map()`. As before, `x` must be a string to capture any trailing zeros: ```{r} -grim_ratio(x = 1.42, n = 72) +grim_probability(x = "1.40", n = 72) -grim_ratio(x = 5.93, n = 80, items = 3) +grim_probability(x = "5.93", n = 80, items = 3) # Enter `x` as a string to preserve trailing zeros: -grim_ratio(x = "84.20", n = 40, percent = TRUE) - -# Upper bounds: -grim_ratio_upper(x = 1.42) -grim_ratio_upper(x = "84.20", percent = TRUE) +grim_probability(x = "84.27", n = 40, percent = TRUE) ``` -In addition, `grim_total()` takes the same arguments but returns only the numerator of the above formula: +`grim_map()` displays a `probability` column that shows the probability of GRIM inconsistency. It is derived using `grim_probability()`. + +In addition, `grim_total()` takes the same arguments but returns only the numerator of the fraction in the above formula: ```{r} -grim_total(x = 1.42, n = 72) +grim_total(x = "1.40", n = 72) -grim_total(x = 5.93, n = 80, items = 3) +grim_total(x = "5.93", n = 80, items = 3) -grim_total(x = "84.20", n = 40, percent = TRUE) # Enter `x` as string to preserve trailing zero +grim_total(x = "84.27", n = 40, percent = TRUE) # Enter `x` as string to preserve trailing zero ``` -If `grim_map()`'s `prob` argument is set to `TRUE`, it adds a `prob` column that shows the probability of GRIM inconsistency. `prob` is derived from left-censoring the `ratio` column at 0, so it is equal to `ratio` if and only if $0 \leq ratio$. If $ratio < 0$, then $prob = 0$. (The GRIM ratio cannot be 1 or greater.) - -#### Interpretation - -If the GRIM ratio is non-negative, it can be interpreted as the proportion of inconsistent value sets corresponding to a given set of parameters. This is also the probability that a randomly chosen mean is GRIM-inconsistent. If the ratio is negative, the probability is 0. - -Similarly, if the `grim_total()` value is non-negative, it can be interpreted as the total number of GRIM inconsistencies corresponding to a given set of parameters. If it is negative, that total is 0. +The result is the total number of GRIM-inconsistent value sets with the given parameters. However, this is generally less useful than `grim_probability()`: the result is only comparable across different numbers of decimal places when normalized by $10^D$. #### Origins -Although the term "GRIM ratio" is new, the formula is arguably implicit in Brown and Heathers' (2017) paper on GRIM. The numerator is a transformation of the formula presented on p. 364, and the authors discuss a common special case of the ratio (interpreted as a proportion) on p. 367: +The formula for the probability of GRIM inconsistency is arguably implicit in Brown and Heathers' (2017) paper on GRIM. The numerator is a transformation of the formula presented on p. 364, and the authors discuss a common special case of the probability (interpreted as a proportion) on p. 367: > With reporting to two decimal places, for a sample size $N < 100$ [and a single item], a random mean value will be consistent in approximately $N$% of cases. Assuming $N = 70$ and inserting all of these values into the above formula returns $$ -\frac{10^2-70×1}{10^2} = 0.3 +max(0, \frac{10^2-70×1}{10^2}) = 0.3 $$ so that a random mean will be inconsistent in about 30% of cases and, conversely, consistent in about 70%. @@ -406,11 +401,9 @@ so that a random mean will be inconsistent in about 30% of cases and, conversely Here is the same in code (assuming an arbitrary mean with two decimal places): ```{r} -grim_ratio(x = 0.99, n = 70) +grim_probability(x = "0.99", n = 70) ``` -Thus, all I did regarding the GRIM ratio was to make the general formula explicit and give it a name. Researchers may judge for themselves how useful it is for further analyses. - ### Granularity and scale items The granularity of a non-continuous distribution is the minimal amount by which two means or proportions of the distribution can differ. It is derived from the sample size and the number of scale items. The number of items, in turn, naturally follows from the distribution's sample size and granularity. diff --git a/vignettes/related.Rmd b/vignettes/related.Rmd index a7da063b..09de07a3 100644 --- a/vignettes/related.Rmd +++ b/vignettes/related.Rmd @@ -32,6 +32,8 @@ Please contact me if you know about relevant software that isn't listed here (em - [jfa](https://koenderks.github.io/jfa/index.html) by Koen Derks offers a full statistical auditing suite (including Benford analysis). +- The Rust crate [SeaCanal](https://github.com/saghm/sea-canal#how-does-seacanal-work) analyzes numeric sequences, uncovering patterns of operations that might have generated them. + - Emerging from the Pruitt investigations, there is now R software for analyzing sequences: - The package [twopointzerothree](https://github.com/Sorbus-torminalis/twopointzerothree/) (by an anonymous developer) checks data for sequences of perfectly correlated numbers. These numbers are either duplicates of each other or they are duplicates offset by some constant amount; hence the name. diff --git a/vignettes/rounding-in-depth.Rmd b/vignettes/rounding-in-depth.Rmd index b2f4b209..1abe0f17 100644 --- a/vignettes/rounding-in-depth.Rmd +++ b/vignettes/rounding-in-depth.Rmd @@ -142,7 +142,7 @@ up2 <- round_up(vec2) down2 <- round_down(vec2) even2 <- round(vec2) -vec1 +vec2 up2 down2 @@ -167,7 +167,7 @@ mean(round(vec3, 1)) Sometimes `round()` behaves just as it should, but at other times, results can be hard to explain. Martin Mächler, who wrote the present version of `round()`, [describes the issue](https://rpubs.com/maechler/Rounding) about as follows: -The reason for the above behavior is that most decimal fractions can't, in fact, be represented as double precision numbers. Even seemingly "clean" numbers with only a few decimal places come with a long invisible mantissa, and are therefore closer to one side or the other. +The reason for the above behavior is that most decimal fractions can't, in fact, be represented as double precision numbers. Even seemingly "clean" numbers with only a few decimal places come with a long invisible mantissa, and are therefore closer to one side or the other. (To see this clearly, try entering a few decimal numbers on [Float Exposed](https://float.exposed).) We usually think that rounding rules are all about breaking a tie that occurs at 5. Most floating-point numbers, however, are just somewhat less than or greater than 5. There is no tie! Consequently, Mächler says, rounding functions need to "*measure, not guess* which of the two possible decimals is closer to `x`" --- and therefore, which way to round. diff --git a/vignettes/wrangling.Rmd b/vignettes/wrangling.Rmd index 3867b791..a1f8da43 100644 --- a/vignettes/wrangling.Rmd +++ b/vignettes/wrangling.Rmd @@ -158,18 +158,18 @@ flights2 %>% debit_map() ``` -If your strings look like `"2.65 [0.27]"`, specify the `.sep` argument as `"brackets"`. Likewise for `"2.65 {0.27}"` and `.sep = "braces"`. What about other separators, as in `"2.65 <0.27>"`? Specify `.sep` as those two substrings, like `.sep = c("<", ">")`. In all of these cases, the output will be the same as the default would be if the strings were like `"2.65 (0.27)"`. +If your strings look like `"2.65 [0.27]"`, specify the `sep` argument as `"brackets"`. Likewise for `"2.65 {0.27}"` and `sep = "braces"`. What about other separators, as in `"2.65 <0.27>"`? Specify `sep` as those two substrings, like `sep = c("<", ">")`. In all of these cases, the output will be the same as the default would be if the strings were like `"2.65 (0.27)"`. ### Column name suffixes -The defaults for column name suffixes are (1) `"x"` for the part before the parentheses and (2) `"sd"` for the part inside of them. However, this won't fit for all data presented like `5.22 (0.73)`. Override the defaults by specifying `.col1` and/or `.col2`: +The defaults for column name suffixes are (1) `"x"` for the part before the parentheses and (2) `"sd"` for the part inside of them. However, this won't fit for all data presented like `5.22 (0.73)`. Override the defaults by specifying `col1` and/or `col2`: ```{r} flights2 %>% split_by_parens(end1 = "beta", end2 = "se") ``` -These suffixes become column names if `.transform` is set to `TRUE`: +These suffixes become column names if `transform` is set to `TRUE`: ```{r} flights2 %>%