Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

om configure-product should be enhanced to support collections #207

Closed
sadvani opened this issue Jul 19, 2018 · 29 comments
Closed

om configure-product should be enhanced to support collections #207

sadvani opened this issue Jul 19, 2018 · 29 comments

Comments

@sadvani
Copy link

sadvani commented Jul 19, 2018

Configuring or Editing collections in OpsManager via the API behaves in a non deterministic manner. Some service tiles (MySQL v2, maybe others) use collections to define Service Plans in the tile. Currently, if a collection has multiple records, and a change is being made to one collection record by passing on only the GUID of the record in question along with the change, OpsManager will delete all other collection records. This means that only the record that the Operator was trying to edit will be saved. Additionally, if an operator tries to change a collection record without specifying a GUID, OpsManager will proceed to delete all records and create a new one with a new GUID. This issue has been seen a few times with the MySQL Tile [1]. We flagged this issue with the Pcf-Pipelines and PAS teams and they suggested we open an issue in this repo [2]

A workaround for this issue is for om configure-product to send all GUIDS and plans on every PUT.

[1] vmware-archive/pcf-pipelines#332
[2] https://pivotal.slack.com/archives/C0554UNSS/p1531941382000046

@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@mcwumbly
Copy link
Contributor

Thanks for reporting this here!

Do you know how can I reproduce this issue currently? Simply by trying to run om configure-product with the MySQL v2 tile? Does that always fail now? Or is there a bit more to it than that?

/cc @jtarchie @ljfranklin

@sadvani
Copy link
Author

sadvani commented Jul 19, 2018

I believe it does always fail. You could either use the MySQL v2 tile or the OpsManager example product. In the MySQL case, you will have to deploy the entire once and then use configure-product and deploy again. This deploy should fail because the GUIDs associated with the deployed service plans will have changed.

You can use the example tile by just using configure-product and noticing the strange behavior of records getting deleted, GUIDs changing, etc.

@mcwumbly
Copy link
Contributor

Cool, I'll assume that's the case. We'll try to prioritize this ourselves at some point if no one else gets to it first, but I'm throwing a pr-welcome on it in the meantime.

@jtarchie
Copy link
Contributor

@sadvani and @mcwumbly, are you able to review the PR from @calebwashburn. I've been able to see that the code works, but I don't have a way (or knowledge) on how reproduce this issue.

@sadvani
Copy link
Author

sadvani commented Oct 1, 2018

@mcwumbly @jtarchie @calebwashburn is this issue fixed?

@calebwashburn
Copy link

@sadvani Believe so

@fredwangwang
Copy link
Member

yes, this pr #227 has been pulled in. We validated using @calebwashburn's gist. Do you still see a problem?

@sadvani
Copy link
Author

sadvani commented Oct 1, 2018

Perfect, thanks so much @calebwashburn @fredwangwang.

Is there a specific version of OM that I should tell the customer to upgrade to?

@fredwangwang
Copy link
Member

master, there is no release cut since the fix introduced.

@jtarchie
Copy link
Contributor

@sadvani: I think we might have to revert this change. An issue #274 shows that we cannot rely on a collection having a name property to match with the guid. We don't have any guarantees with a collection that there is uniqueness on a single property of a collection. This broke other collections.

We might have to be stronger on this an enforce the any element on a collection that has to be persisted and updated (not removed and recreated) should always have a guid.

@jtarchie jtarchie reopened this Oct 16, 2018
@anEXPer
Copy link
Contributor

anEXPer commented Oct 17, 2018

@dsboulder I understand you're currently backing @sadvani up as Ops Man PM. Any thoughts on this?

@dsboulder
Copy link

I validated that if collection records as passed with a "guid" field, then those existing records will be found and updated. Otherwise we'll consider them new records (and delete any unmatched-by-guid records at the end).

@fredwangwang
Copy link
Member

@sadvani @nbconklin @mcwumbly

Configure-product with collection types

Current approach in om:

There mixed strategies in om right now, below are the two behaviors:

  1. Collection with name key in the entry:
    We do a get first to grab all the collection entries in OpsManager.
    Compare the existing collection entries with the payload entries provided.
    Backfill the guid if the name of the payload entry matches the existing entry.
    Basically treat name as the primary key if possible.
  2. Other collections:
    Does not do any modification to the payload entries, the payload will be send to OpsManager as is.

Problems with the current approach:

  1. For strategy 1, infer the guid using name key prevent the ability to change the name for the collection. We currently do not document clearly the constraint, which would cause unintended behavior if the user "accidently" rename the collection.
  2. For other collections, not decorate the collection would almost definitely introduce a problem. As the entries without guid will be treated as new entries, and existing entries in OpsManager will be deleted. This is usually destructive.

To conclude, the workaround is based on an undocumented constraint (enforcing name as primary key), and it does not work for all collections. For the collections without a name, the effect of reconfigure could potentially be very bad.

Potential solution:

Enforcing guid.

Guid is the true primary key of the collection type, OpsManager uses guid of the payload entries to determine if the configration is:

  • updating existing records or
  • creating new records (and potentially remove exisitng records)

Since guid is the source of truth, always providing guid in the payload would guarentee the meaning of the configuration, instead of (wrongly) guessing intention of the configuration.

Steps to do to enforcing guid

om cli:

  1. return guid to the user in staged-config command
  2. remove guid decoration logic in configure-product command
  3. error out when configuring product if guid is not provided for the collection

OpsManager:

Even if the action is taken on the om cli side, the api itself still allows "wrong" behavior, potential solution:

  1. disallow the removal of existing collection entry whose definition is used by something else (breaking)

operator:

manually decorate the collection entry with a guid. The guid does not have a specific format, as long as it's unique.
Sample payload:

some-collection:
- name: entry-one
  guid: entry-one # for example, provide a "meaningful" string as guid for simplicity
  prop: value1
- name: entry-two
  guid: fdr123prc8 # random string also works, as long as it's unique
  prop: value2

@kcboyle
Copy link
Contributor

kcboyle commented Feb 4, 2019

a similar issue was reported here: #315 for the appdynamics tile

@gbark13
Copy link

gbark13 commented Sep 9, 2019

Any updates here? I recently encountered what I think is this issue when upgrading the Redis Enterprise tile. I had issues with the Broker Registrar errand where it was trying to create new plans with different GUIDs but the same plan names. I had to rename existing plans, let it create the new plans alongside the existing plans, and then migrate each service to the renamed plans. Then the broker registrar errand was run manually again to delete the original plans that it had trouble with.

@kcboyle
Copy link
Contributor

kcboyle commented Sep 9, 2019

We're still discussing a way to do this. the problem with enforcing the GUID as we described above is that you could not use the same config to re-configure a new env. This is not desired behavior.

We discussed potentially doing something like this (a proposal from the story in our tracker):
can om lookup the config for a nameless collection and do a comparison to see if there are any changes (other than guid) and, if there are no changes, grab the existing guid and inject it when configure-product is run. This would be an alternative to forcibly requiring a guid in the config file for configure-product, but might not be possible. This is worth exploring to allow OpsMan to keep its contract with BOSH (guid being guaranteed unique) but allow for a good user experience that will not "change" the guid when the configuration for a collection is the same.

PROS:

  • we can do this now
  • good customer experience
  • no breaking change
    CONS:
  • potentially difficult
  • requires multiple calls to the OpsManager API before actually configuring the product (if the name field does not exist) to validate that the configuration is the same. (additional complexity for configure-product)

@gbark13
Copy link

gbark13 commented Sep 9, 2019

OM is the thing that causes the GUID issue. Manually changing within OpsMan UI does not. Is that accurate?
Therefore any PCF Platform Automation that deals with a collection could trigger this issue?

@kcboyle
Copy link
Contributor

kcboyle commented Sep 25, 2019

it is not om causing the issue, but the tile authors.

You are correct in that the UI does not have this problem, but om is limited by what the API can return. Because each tile team on Pivnet is in control of what their collections look like and optionally define a name, we have a hard time using the API to handle collections because there is nothing else we can key off of. This is not an issue with Platform Automation, but with the tile authors themselves. The optimal fix for this issue is to require every tile team to name their collections, but this is a much larger undertaking.

@mboldt
Copy link

mboldt commented Nov 4, 2019

I find it quite unfair to claim that the cause of this issue tile authors not having a name property in every collection. The root cause seems to be this limitation in the Ops Manager API:

the problem with enforcing the GUID as we described above is that you could not use the same config to re-configure a new env. This is not desired behavior.

This issue existed before the name workaround was added.

If, going forward, the expectation is that every collection has a name field:

  1. This needs to be documented in the Tile Developers Guide (at a minimum in the collection field documentation, perhaps there are other important places to include this). Could we get that on the om backlog, since this is where the requirement is coming from?

  2. This needs to be widely communicated to our partner tile authors as something that needs to be updated in their tiles. My team (Platform Engineering) can help with this once the expectations are well-documented.

In the meantime, it sounds like the workaround for customers would be to manage the guid fields, specific to each Ops Manager environment. Not a great experience, but could get them by, right?

@jtarchie
Copy link
Contributor

jtarchie commented Nov 4, 2019

@mboldt, It is issue for the tile authors and the OpsManager API.

For tile authors, because there's an experience of their tile that doesn't work for users who have request being able to duplicate across multiple foundations.
For Ops Manager, because there there's an experience with trying to extract configurations.
The fact that this is an issue and all you have not talk to each other about this is systematic problem.

The above was a proposal as we were trying to solve this from a Platform Automation standpoint.

We have talk to Ops Manager about it.
We have talk to a tile team(s?) about this.
Here is a story that reflects some of those discussions.

We have made an effort to communicate this issue to the right people.
We've reprioritized based on other features our product needs.
If you would, please, as a tile team, like to work with the OpsManager team to fix the user experience.
They way your users will be happier about using your product.

@kcboyle
Copy link
Contributor

kcboyle commented Mar 9, 2020

For those still watching/following this issue, does anyone have any other ideas around what om can do to mitigate this issue? this comes up as a team question every now and then, and we can't really think of a good solution. Happy to reopen this discussion to get some fresh ideas/insight around this problem area

@calebwashburn @jtarchie @anEXPer @sadvani @mcwumbly @dsboulder @gregbarkerdcsg

@kcboyle
Copy link
Contributor

kcboyle commented May 1, 2020

If we can't think of any good viable solutions, I'm going to see about closing this issue for now. If anyone has any ideas for us to work off of, feel free to reopen the discussion

@mrdavidlaing
Copy link

Based on the following analysis - om configure-product should be enhanced to support collections - Analysis & Recommendations

I believe we can address the majority of remaining issues with the following changes:

  1. [om cli] Use collection item hash to find GUID rather than name - this covers all cases where no change is being made
  2. [om cli] Expand the list of “logical Id” fields from name to include key or any field ending in name - this covers nearly all changes except those where the logical key field itself is being changed

@mrdavidlaing
Copy link

mrdavidlaing commented Aug 12, 2020

Once PRs #502 and #503 have been merged; using om in a GitOps workflow will associate existing GUIDs to collection items as follows:

  1. When no changes have been made; no GUIDs will be regenerated and no changes will show in om bosh-diff. ie:
    • After exporting tile config: om staged-config -p example-product -c > example-product-config.yml
    • When configuring the tile using that config (having made no changes): om configure-product -c example-product-config.yml
    • om bosh-diff will show no changes
  2. When making changes to collection items that have a logical key field (named name, key or ending in name); existing GUIDs will be associated with the changed collection item causing an edit of the existing item rather than a deletion and recreation (with a new guid). ie:
    • When configuring the tile with a example-product-config.yml that contains the following changes:
      image
    • om bosh-diff will show the following edits to existing collection items:
      image
      Note that:
      • plan_collection.name: plan_1 shows the description field being edited - the name field is being used to associate the existing GUID
      • plan_collection.name: plan_2 doesn't show up; since it isn't changing
      • plan_collection.name: plan_3 shows up as an addition
      • regex_collection.regex: (one) doesn't show up; since it isn't changing - its value has been used to associate the existing GUID
      • regex_collection.regex: (two) shows up as a delete / recreate (with new GUID) - its value has changed AND since it has no logical key there is no way to associate it with the existing GUID

An example product showing this in action is at pivotal-cf-experimental/ops-manager-example#3

Known Issues

  • GUID association does not work when changing an item that doesn't have a logical key field
  • GUID association does not work when changing the value of the logical key field
  • GUID association does not work for collections items that have a secret field AND no logical key field

@mrdavidlaing mrdavidlaing reopened this Aug 12, 2020
@cf-gitbot
Copy link
Member

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

@kcboyle
Copy link
Contributor

kcboyle commented Aug 12, 2020

This was fixed by the associated PRs.

We're happy to see this long-running issue finally addressed!

@kcboyle
Copy link
Contributor

kcboyle commented Oct 7, 2020

Another bug was discovered after the merging fix was done.
The feedback:

Our work flow ...
Upload and manually configure HW2.
 2.  Apply changes.
 3.  Download config for running tile and convert to a template for use in our TF code.
We've had a number of upgrades recently and it appears the OM cli that is bundled in the Platform automation (OM cli v6.3.0) doesn't work with the scrape job collection in the HW2 tile.
Further investigation revealed that OM cli v6.1.1 (on my workstation) works, and newer versions, both opensource and supported, do not work with the healthwatch2 config when more than 1 scrape job is present.
There was of discussion in the OM cli repo regarding collections and impacts to tile creators.

Digging into it,

  • HW has a collection of scrape jobs used by prometheus to scrape metrics from different endpoints. Similar to how -p-mysql defines plans
#edited-healthwatch-config
.properties.scrape_configs:
    value:
    - ca: |
        -----BEGIN CERTIFICATE-----
        <REDACTED>
        -----END CERTIFICATE-----
      insecure_skip_verify: true
      scrape_job: |
        job_name: 'ec2'
        scheme: https
        ec2_sd_configs:
        - region: us-west-2
          port: 9100
        relabel_configs:
        # Only monitor instances with a MetricsKey tag matchine mine
        - source_labels: [__meta_ec2_tag_MetricsKey]
          regex: pws dark ci bwAvFbeC5X
          action: keep
        - source_labels: [__meta_ec2_tag_Name]
          target_label: name_tag
        - source_labels: [__meta_ec2_availability_zone]
          target_label: availability_zone
        - source_labels: [__meta_ec2_instance_id]
          target_label: instance_id
      server_name: null
    - ca: |
        -----BEGIN CERTIFICATE-----
        <REDACTED>
        -----END CERTIFICATE-----
      insecure_skip_verify: false
      scrape_job: |-
        job_name: 'fluentd'
        metrics_path: /aggregated_metrics
        static_configs:
        - targets:
          - fluentd.<REDACTED>.com:9200
      server_name: null

When uploading/configure HW using 6.1.1, both expected jobs show up in the HW config
when uploading/configure HW using 6.1.2, only the second job appears in the config
This was on OpsMan 2.9.11 using om 6.3.0

@mrdavidlaing
Copy link

I suspect the issue has to do with the nested collections (eg scrape_job.relabel_configs) - that wasn't a use case I expected / tested when working on the original PR(s)

Let me know if I can help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests