Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize Bottle Labels #210

Open
dcramer opened this issue Aug 3, 2024 · 16 comments
Open

Normalize Bottle Labels #210

dcramer opened this issue Aug 3, 2024 · 16 comments

Comments

@dcramer
Copy link
Owner

dcramer commented Aug 3, 2024

We're going to take a clean pass at bottling and solve this once and for all.

The plan is the following:

  • Expand bottle attributes to cover just about everything (see further down)
  • Add a 'parent' bottle concept, allowing us to group variants of bottles (see Kill batch information #209). This could be a parentId attribute, or simply a Variant Table (some complexity here).
  • Expose parents by default in search, with the ability to also expose children (e.g. show the parent, select also children of any matching parents that match via search, and show up to a few per parent in the results)
  • Automatically create the parent/children based on the bottling information added.
  • Expose an "Add Variant" (name TBD) flow, that takes the core bottle parent information, and allows you to focus on filling the specific edition details.
  • Focus bottle pages on the parents as the primary results, but have parents show all chidlren, and have children clearly show that they're part of a larger series.
  • We will have situations where there is one parent/child variant, and we should just treat that like its a normal bottle, but with the ability to add variants (this is going to matter in the tasting/collection flow probably).
  • Everything deterministic, when possible, should be pulled out of the bottle name and into attributes. If these attributes can vary within an series, that means they should be part of the variants unique constraints AND not present on the parent bottle.
  • Bottles may have a "core" bottle, so we need to make sure thats represented well. For example Laphroiag 10 is the core, but there might be a variant thats the 225th Anniversary Edition.

In general, I think what we're trying to do here is implicitly create "series" but in a more structured manner than something like Whiskybase does.

For bottle attributes, one of the biggest things we have to determine is which attributes enforce that its a variant. This is primarily going to be single cask focused.

Here's a list with the help of ChatGPT. Some of these are deterministic via the name, others are not whatsoever.

  • Brand Name: The Macallan
  • Bottle Name: 12-year-old
  • Edition: 225th Anniversary
  • Type: Single Malt, Single Grain, Blend, Bourbon, Rye, etc.
  • Cask Strength?: Cask Strength, Barrel Proof, Full Proof, Natural Strength, Original Strength, Undiluted
  • Single Cask?: Single Cask, Single Barrel, Cask No., Barrel No., Selected Cask
  • Stated Age: 12
  • ABV: 40.5
  • Proof: (can be different than abv, ffs)
  • Cask Type: ex-Bourbon, Sherry, Port
  • Distiller: (distilled at X)
  • Batch: Batch 5, Batch A, Batch No. 1
  • Barrel No.: 3 (only if single cask)
  • Release Year: 2024 Release
  • Vintage Year (aka Distillation Year): 2024 Vintage
  • Bottling Year: [sometimes the same as release year, otherwise vintage+age]
  • Finish: Oak, Sherry, Port
  • Total Bottles: 1,024
  • Natural Color?:
  • Non-Chill Filtered?:
  • Availability:

Things that show up in the name and are non deterministic:

  • Edition: 2025 Edition - could mean release, vintage, or bottling date.
  • Generic Flavor Text: "Whiskey", "Whisky", "Scotch" - generally regional information SOMETIMES part of the bottle name (e.g. 291 Colorado Bourbon)

Some unknowns:

  • "Straight Bourbon" / Rye - how does this get classified? feels too nitpicky to make them their own category
  • Bottled in Bond?: This is a yes/no and is similar to "is this scotch". Its often part of the bottle name. Kind of feels a little too nitpicky to pull out.

Other things we should consider:

  • Awards:

Realistically most of these are going to be focused on the variants. The parent probably only includes a few key items (which cannot change between children):

  • Brand Name
  • Bottle Name
  • Type
  • Cask Strength (TBD - this fits in ABV but that per edition? cask strength is drastically different though..)
  • Finish? (from a UX pov it makes sense these would be editions, but from a unique bottle they're wildly different characteristics)
  • Single Cask
  • Stated Age - some distillers have different ages for each release
  • Distiller (Should we make Distiller unique on the parent? what about bottlers series/etc?)
@dcramer
Copy link
Owner Author

dcramer commented Aug 3, 2024

For the variants, I think Ardbeg is a really strong litmust test, specifically this bottle:

https://peated.com/bottles/40876

Another important thing we must consider: Casks are sometimes important sometimes not. Every SMWS bottle is single cask, so we need to make sure the case of "a single variant" is really smooth and looks no different than a single non-variant bottle flow (e.g. Laphroiag 10).

One last thing that needs to be thought about now that I'm writing this out:

What about special releases of typical labelings? IMO they should probably go under a variant. e.g. the 225th Anniversary Edition.

@dcramer
Copy link
Owner Author

dcramer commented Aug 3, 2024

a note about proof/abv/etc:

These are not primary variant factors, so we may also need to determine makes a variant unique, and what data should be approximate.

For example, if the proof is "120-130", we dont actually care. Thats not enough to create a new variant, its just variable (as is expected, tbqh). So do we even store proof? ABV is an important thing to some degree, but how do we deal w/ the fact that it varies?

@dcramer
Copy link
Owner Author

dcramer commented Aug 3, 2024

The now most critically important question: what do I call them?

Thinking 'edition' for now.

@dcramer
Copy link
Owner Author

dcramer commented Aug 4, 2024

Working branch is feat/editions.

Going to keep the 'bottle' table for the aggregations, and use the new bottle_edition` table - first to copy all the existing data into it - and eventually to be the canonical reference for full bottle details.

Every exist bottle will at minimum have one row in bottle_edition, and then we'll collapse a bunch of bottles into each other.

@dcramer
Copy link
Owner Author

dcramer commented Aug 6, 2024

Rethinking this with fresh eyes this morning.

  1. We'll keep bottle as is, and expand its attributes.
  2. We'll add (likely) a parentId to bottle.

This should make it cleaner to actually get this change done, as right now looking at renaming things, and breaking up variants from bottles.. its just too many changes and its not completely objective.

For the bottles details page, this means youll still be able to permalink every bottle, and we'll simply add an "Editions" (tbd) section on it that shows the other bottles. That will show both for the parent bottles as well as all other editions.

We'll also still need the 'edition' (nullable) string column on the bottles table.

@dcramer
Copy link
Owner Author

dcramer commented Aug 6, 2024

Pushed singleCask and caskStrength flags (and name detections).

Working on getting edition in now, and migration BottleAlias.name to be a mirror of Bottle.fullName, which means Bottle.fullName will become less used in the UI (e.g. when we want to break up "Laphroiag" "12-year-old" and "225th Anniversary" components).

@dcramer
Copy link
Owner Author

dcramer commented Aug 7, 2024

Im realizing my primary issue is likely from trying to generate a unique label as a string.

Let's take this random 40 year:

Tomatin 40-year-old

The vintage year matters, more so than it does with many others. Do you have to duplicate the vintage year in to the edition now? Thats silly. What you want to do is just fill out the bottle information in as much detail as you can, and have the system understand if its a duplicate or not.

The problem is two things:

  1. A human readable names
  2. Missing information that could exist in the future

The first issue I think can be addressed through generated names. We can look at all bottles in a series on a write, and generate. description name (particular with the subtext field). Or we can be dumb about it for now and just do some rule-based heuristics for the display name.

The second issue is likely just going to need dupe detection. There's various techniques we can use to identify duplicates, help merge them, and help avoid future duplicates. Mostly this comes down to making the bottle search and add bottle flows very easy to identify potential matches.

So I think the next step, after I clean up some data, is likely to figure out the unique constraint solution.

I'll try to keep edition one field for now, and continue to overload it with batch/series/etc information.

@dcramer
Copy link
Owner Author

dcramer commented Aug 7, 2024

Fresh eyes this morning, I have a mental model for how to deemphasize editions in the database (thus removing a lot of the noise to beginners). The core concern that I need to solve to pull this off yet though is the approach to naming editions.

Right now there are a lots of variables in play, but effectively we need the Bottle.name to become the bottling series, and the Bottle.edition to become the descriptor of the individual bottle.

I want to take a common scenario that poses the UX problem I'm having:

Angel's Envy Cask Strength 2020

  • Angel's Envy Cask Strength is the series
  • 2020 is the edition

However, in this case, 2020 is also the Release Year. I wanted to avoid filling in duplicate details - we already have some silliness with the name vs statedAge. Maybe we should just ignore the release year field as a goal right now though? Force filling in the edition for these variable details, try to pusht he user to enter the right information, and then build some tooling to improve over time.

@dcramer
Copy link
Owner Author

dcramer commented Aug 7, 2024

Two open scenarios that are more tricky:

  1. Tomatin 12-year-old, and Tomatin 12-year-old Sherry Cask. Are these separate bottles or just separate editions? I lean towards the former, but where do we draw the line?

  2. Diageo Special Releases - often these are normal bottles with a limited release. They clearly seem to imply an edition of a bottle, is the edition "Diageo Special Releases 2023" , as an example?

@dcramer
Copy link
Owner Author

dcramer commented Aug 7, 2024

Here's a thought exercise:

  • Name becomes Series
  • Edition stays optional

Exercise:

  • [Brand: Laphroaig] [Series: 10-year-old] [Edition: null]
  • [Brand: Laphroaig] [Series: The Ink of Legends] [Edition: 2023]

Ok those two work, but we're still stuck here:

  • [Brand: Laphroaig] [Series: 10-year-old] [Edition: Sherry Oak Finish]

@dcramer
Copy link
Owner Author

dcramer commented Aug 8, 2024

Some more challenges:

Kilchoman Spring Release 2010

Whats the series name? Spring Release?

@dcramer
Copy link
Owner Author

dcramer commented Aug 8, 2024

One obvious rule we can add:

  • Parse out everything (do not ever show age statement or category in the name by default)
  • Only if there is no name, insert [AgeStatement Category]
  • Do not allow only age statement, do not allow only category (assuming both are present)

This doesn't solve for "what about the release/batch/edition". We could continue to keep that as a separate field.

None of this helps us with deduping yet, or creating those series concepts.

@dcramer
Copy link
Owner Author

dcramer commented Aug 9, 2024

Need to determine if finish is a worthwhile field to add. Its really equiv to edition for how we want to utilize it in a lot of ways, but I dont know that'd we'd want to aggregate different finishes together within the same series.

@dcramer
Copy link
Owner Author

dcramer commented Aug 22, 2024

After living with this for a couple weeks, I'm not sold on this editions field. Sure its hypothetically better to dedupe things, but it feels more tedious from a manual input point of view. You're sitting there, staring a bottle label, and you just have to ask yourself "wtf is the name and wtf is the edition". That's not fun.

I may revert it and combine edition back into name. Doesn't mean we can still pull out some of the things above.

@dcramer dcramer changed the title Expand Bottles to Aggregate Variants Normalize Bottle Labels Dec 3, 2024
@dcramer
Copy link
Owner Author

dcramer commented Dec 3, 2024

I think the best approach here is going to be to do the following:

  • Adjust "name" to focus on "label" - the full bottle label (normalized)
  • Utilize an LLM prompt for normalization

Some thoughts:

  • Adjust "edition" to clarify its for edition, series, expression?
  • Provide context to the LLM of existing examples from a given brand, which will hopefully help normalize the data
  • Search will need drastically improved, TBD what to do here.

We'll need to be concious of when we use the full label as they're quite long and aren't going to render well in a LOT of scenarios. An example of a full label is The Macallan 12-Year 225th Anniversary Single Malt Scotch Whisky.

First passes at building a prompt having been going super well as the outputs are still unreliable (e.g. sometimes itll remove flavor text, like the spirit type, but other times itll leave part of it in, like "Islay").

I still think we want some kind of aggregate, particularly so you can record something like a tasting when you dont know which release you're trying. Take any of these editions where they release a stated year bottling thats effectively the same as other years.

@dcramer
Copy link
Owner Author

dcramer commented Feb 21, 2025

Still struggling on this one, and probably the biggest blocker for progression right now.

I've done a bit of work on the moderation queue, and have been testing some prompts with OpenAI's LLMs. The two main things that have to be solved for are the overal normalization schema (which is going to require this ruleset), and then another secondary approach that takes an abstract bottle label and is able to plug it into a search scheme.

Example of prompt I was using to test some simple normalization technique to help me on the moderation queue:

I have a label of a whiskey bottle that I'm trying to match up against an existing database. To do this I need to do two things:

1. Separate the "Brand" of the whisky from the "Name" of the bottle. The brand has to be associated separately. For example "Jack Daniel's" is a brand.

2. Remove common words such as "Bourbon", "Kentucky", and "Single Malt", as they're not always included in the final bottle name.

3. Finally, if this is something like a specific cask (such as a Store Pick), identify that as we likely don't have it in the databaes.

Given the above, please process the following bottle into "brand", "query", "namedCask" fields:

3 Howls Backbeat Bourbon

(The main thing is being able to actually identify the brand vs the rest of the bottle label)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant