- autoincrement primary key
- autoincrement to prevent reuse of
rowids
when freed up ... which might be valuable for data integrity over the lifetime of the database, - though it apparently takes up more CPU, addition of tags won't be common, so it shouldn't be a problem
- autoincrement to prevent reuse of
- long form (tag + parent_id)
- Use recursive queries to get full parent path
- Ensure uniqueness across tag name and parent_id
- See SQLite Docs
- See on conflict response: https://www.sqlite.org/lang_conflict.html
- To get uniqueness for root tags is tricky, as NULL is always unique!
- Need additional check
- Use trigger to automatically create full parent string for each tag
- uses large recursive query ... could join full parent string on to tags table
- Main purpose of this is to have a good mapping between the unique tag and a textual representation of the tag
- As adding and removing tags will be relatively infrequent, automatically deriving full parent paths should be fine.
- uses large recursive query ... could join full parent string on to tags table
- Architecture
- ?? Initialise tags table with a single tag:
root
with parentNULL
??. root
is the parent of all top level tags (and so used as the filter for top level tags)- Create trigger on
update
,delete
andinsert
(on tags table)- trigger runs:
delete
to clear thefull_tag_paths
tableinsert
on the tablefull_tag_paths
using the recursive query
- trigger runs:
- ?? Initialise tags table with a single tag:
with recursive tags_parents(id, tag, parent_id, parent) as
(
select a.id, a.tag, a.parent_id, b.tag as parent
from tags a
left join tags b
on a.parent_id = b.id
),
pnts(id, tag, parent_id, id_path, tag_path) as
(
select
id, tag, parent_id,
ifnull(parent_id, '-') as id_path, ifnull(parent, '-') as tag_path
from tags_parents
where parent_id is NULL
union
select
m.id, m.tag, m.parent_id,
ifnull(pnts.id_path, '-') || '/' || pnts.id as id_path,
ifnull(pnts.tag_path, '-') || '/' || pnts.tag as tag_path
from tags_parents m
join pnts
on pnts.id = m.parent_id
order by m.parent_id desc
)
select
id, tag, parent_id,
id_path || '/' || id as full_path,
tag_path || '/' || tag as full_tag_path
from pnts
-
Tags will be added to notes through YAML frontmatter
-
Comma separated tags so that can have multiple tags
-
Written as a full path (eg
python/std_lib/datetime
) -
Note -> database
- tag path matched against
full_tag_paths
, which provides the actual tag id that relates directly back to thetags
table. note_tags
table updated with note id and tag id
- tag path matched against
-
database -> Note
- Search for a substring in all tag paths, and select manually
- eg:
[tag_path for tag_path in full_tag_paths if "query" in tag_path]
-
New tag -> database
- Select parent (incl
null
for root level tag) from full_tag_paths then append new child - Can do directly by parsing text directly:
- presume full path!
- find longest already existing path in new path, that starts at 0 in new path
- What if no match? Then every tag is new, and first is at root level ... handle in the adding stage
- get id of parent_path. If not present, then parent_path is null
- Split remainder of new_path into tags
- For each new tag, add with previous parent as parent
- Select parent (incl
-
Files stored fully with FTS
- FTS on top of actual storage in a separate source table: https://kimsereylam.com/sqlite/2020/03/06/full-text-search-with-sqlite.html, https://www.sqlite.org/fts5.html#external_content_tables
- Use triggers to automatically update the FTS from the base table: https://sqlite.org/lang_createtrigger.html
- Use YAML metadata for title, tags
- store metadata in the database separately from the body
- Store latest update date to compare with changes on disk
- store mod time as a string of timestamp ... to rely entirely on python datetimes
- store id as int of timestamp ... only second resolution, but YAGNI for greater resolution (for now) ... maybe increase later or if not unique, wait?
- FTS on top of actual storage in a separate source table: https://kimsereylam.com/sqlite/2020/03/06/full-text-search-with-sqlite.html, https://www.sqlite.org/fts5.html#external_content_tables
-
References
- Bibliographical citations can be added freely to any note.
- To be more structured ... each source can get it's own note, and possibly tagged as a "source" (bigger topic of what to use tags for). Citing a source is a link to a source note, which can also contain a general summary about the source content if desired.
- Stay as close to markdown syntax as possible:
Here is a zettelkasten link: [My First Link](/20201016154437)
- Title of the note is the link title
- This need not match the note's actual title, as it is not static.
- The link title can be just for convenience within the context of the note
- For making new notes from links, through some hotkey, whatever title is provided will have to be used for the actual title of the newly created note.
- The newly created id will have to be inserted automatically
- The URL is a forward slash (
/
) followed by the unique ID - This should have the following qualities:
- Fit's into ordinary markdown
- Easily detectably distinct from ordinary links
- Expressive
\[[\w ]*\]\(\/\d{12,14}\)
Or, alternatively, to allow for any characters in the title:
\[.*\]\(\/\d{12,14}\)
- If/when creating a new note with such syntax, the title will have to be checked to be only ASCII
- pretty easy in python
-
While writing a note, two hurdles might occur
- Inserting a link by the id is difficult as the ID is hard to remember
- Inserting a link to a note you haven't created yet but want to.
-
For the first problem, the only two alternatives are:
- Parse links with titles only and match the written title against the database. Sounds good and easy, as titles are easier to remember, but in practice it will go wrong frequently I imagine
- Use an automated process (through a text editor for instance) for searching for notes and inserting appropriate links
- Would be good to ensure that this facility is not bound to any text editor and can be run through a CLI or web API
-
For the second problem, the two solutions are similar in nature to those above:
- Parse for links that lack an ID (but have a title!), and create those new notes when the note is parsed (perhaps optionally so)
- I Actually like this. It can be a clean fallback which can be run at all times in the process of parsing notes. Parse for
new_links
, create new notes, go back and replace new links with actual links (with ids). Kinda a nice to have at this pint though.
- I Actually like this. It can be a clean fallback which can be run at all times in the process of parsing notes. Parse for
- A quick automated process for creating a new note from some easily selectable text (again, inside a text editor but should not be bound to the text editor's API)
- Parse for links that lack an ID (but have a title!), and create those new notes when the note is parsed (perhaps optionally so)
- Written in YAML (ie, YAML front matter)
- Pretty easy to manually parse
- Fields:
- title
- tags
- extendable in the future? ... really depends on the parsing
---
title: Python
tags: tech
---
## Content Body
This is a link: [Link](/20201016154437)
(?s)^---.*---
(?s)
Sets the dot matches newlines flag (ordinarilyre.S
).
- FTS on title
- FTS on text
- shallow and deep Children of note
- shallow parents of note
- shallow and deep notes of particular tag
- deep means to include all tags that are children of target
- unsure exactly how to do this with current scheme:
- Combine tag queries with rest of above
- This ... could be non-trivial ... especially with the FTS ...
- How combine complex queries with FTS queries? Sub queries and joins?
- How use tags with other queries? may want to rethink the tags model ... depends on performance at scale really
- instead, could leverage FTS and have a tags column with comma separated tags (including tag paths with slashes:
parent_tag/child_tag
) (using either FTS orLIKE
matching: https://sqlite.org/lang_expr.html#the_like_glob_regexp_and_match_operators)- insertion of tag strings could by managed by python, checking the existing tag tables and parent joins, and inserting as appropriate. But then updating and deleting notes and/or tags could get very tricky.
- Instead ... could flip the process:
- Keep current model, but add ...
- first, table of all unique combinations of tags
- second, Assign to each note a single tag id in a single column in the notes table
- Adding a tag to a note, adds the unique combination if it does not exist, then add this id to the note
- But ... how have unique combinations of arbitrary length?
- instead, could leverage FTS and have a tags column with comma separated tags (including tag paths with slashes:
- Use python files (why not!)
- Actually, probably better to use
INI
files as supported by the standard library
- Actually, probably better to use
- Use python import machinery to load an arbitrary file (see code below)
- Have up to three locations for a config file:
- Package default file loaded as the fallback
- Home directory
- Current working directory (??)
- not sure worth it
- Idea would be to have locally independent working spaces defined by local config files
- Options
- zk_path: path of database/filebase (
zekellbase
) - alt_paths: support multiple zekellbases (nice to have really!)
- paths and names
- currently selected zekellbase for all commands to run against
- zk_path: path of database/filebase (
import importlib
import importlib.machinery
import importlib.util
# provide absolute path of config
loader = importlib.machinery.SourceFileLoader(
'test', '/Users/errollloyd/Developer/zekell/.zekell_config')
spec = importlib.util.spec_from_loader(loader.name, loader)
mod = importlib.util.module_from_spec(spec)
loader.exec_module(mod)
# mod.VARIABLE
- To ensure consistency between the file-base and the database, a few batch functions will probably be necessary at some point
- Check all notes in the database correspond to all files in the file-base
- Add all files note in the database
- Remove all files not in the database
- To update links:
- driven by notes, and therefore from files (for now).
- do a full refresh of all links from note:
- delete all with current note as parent
- add links in current note
- update
db.ex()
function to allow batch statements so that the initial removal is rolled back should anything go wrong with the insert
- To update note_tags:
- same as for links
- To update tags:
- Tricky!
- this is to update the names or even the parent(s) of the tags which are already assigned to a note.
- For tags not assigned to a note, essentially a simple update
- to change the name of a tag would require updating the text of every note, reliably.
- update the tag column in tags
- using regular expressions
re.sub()
, iterate through every note with specific tag in note_tags, find tag in frontmatter, and replace with new tag. - As the ids of the tag and notes stay the same, nothing else needs to be changed
- to change the parent tag of a tag (ie, to move to a new parent)
- it means whole path will change
- old parent will stay in tact (removal of that would be another process)
- create new tag under desired parent
- as above, go through all necessary notes to change the note path of the original tag text in the frontmatter
- To change any/all component of a tag's path
- chain the processes above together essentially
- create new tag as one would create a new tag path
- iterate through all old notes and substitute old text with new
- The central challenge is relaibly substituting the old text in the frontmatter with the new tag path
- Engineering the regular expression will be important but doable
- enabling a note backup and rollback process is probably a good idea.
- Simple remove, conditional on whether any rows in
note_tags
containing the tag- Perhaps a fancy function at some point that will automatically remove such tags from the text of notes
- Move all creation SQL code to
schema.sql
- Adjust tests accordingly:
- Create files with FTS
- Create references table
- create note-tags table
- create tags hierarchy auto-make trigger
- make trigger occur on insert, update and delete
- write tests for full_path_tags triggers
- Parse files for metadata
- Add to table
- Include update and delete functionality
- Add to files/notes table
- Sort out SQL commit and close workflow!!
- Just commit on all calls to
ex()
?- Yea ... this ... already done with context manager
- Or, something more sophisticated?
- Just commit on all calls to
- add to note_tags table
- Use proper updating code for tags and links
- Add batch statements in db.ex()
- Update links instead of merely adding
- update note_tags instead of merely adding
- Enable adding whole note from scratch
- Enable configuration for locating database etc
- Utility function for batch update of database from notes
- add file to database
- batch add multiple files to database (taking care of linking issues)
- find all files ahead of database (using mod time and os modtime?)
- batch add all files ahead of database
- Clean up tag paths (especially in
add_tag_path
)- probably best to use
pathlib.PurePath
as it's got clean interfaces to parents etc
- probably best to use
- Delete notes
- Remove links
- Managing Duplicate notes (created by accident lets say)
- Merge Notes into one
- Pick one as survivor
- put contents of deprecated into survivor
- add tags of deprecated
- find all links to deprecated and change to survivor (will require overwriting contents of notes!)
- Delete deprecated
- Merge Notes into one
- Check for duplicate title when making a new note!
- Enable quickly creating an empty note as a link in the current note for when you know a topic/idea deserves a note which will be filled up over time or in the future and so you're ensuring that a link is laid for the current note and that the new note is established for the future.
- Works together with checking for duplicate notes (see above)!
- Change note title
- Because of the binding between note files and notes in the database, this may not be trivial
update_note()
works by updating all fields for the record with the provided note id. So, for this function, relying on a constant note_id is viable, so long as a valid file path is provided with the current version of the note, even with a changed title.- The CLI
update
relies on the note_id to then retrieve the title of the note from which the path of the file is inferred. This is quick and convenient. But, to change the title of a note, a distinct function will be required.- takes note_id (fuzzy too)
- obtains full note_id from database along with older title
- Searches for note_file with matching note_id
- If can't be found, error
- performs update using path of actual note file found in search
- Update
cli_update
to search for matching notes (by note_id) when the file_path title has changed - When using the package and not the CLI for the sublime plugin, clean the process up by updating directly with the file's current path and name using the
update_note()
function, checking if the id exists etc. - Updating a note, which is done by starting with the file, will hit an error if the title has been changed (even though the id is unchanged).
- Because of the binding between note files and notes in the database, this may not be trivial
- Delete Tags
- Only if not used in any notes!
- add a description column to
tags
table as a helpful reminder/reinforcer of the meaning/purpose of tags
- add to references table
- create assets and asset links tables
- Look into adding indices for columns likely to be used heavily in queries
- Eg, parent and child note ids in
note_links
. - See sqlite docs on query planning and indices
- Whereever
UNIQUE
constraints are applied an index is effectively created (see docs on create table).- So ... only the
note_tags
table would need an index... or unique constraint on both columns
- So ... only the
- Eg, parent and child note ids in
-
Add unique constraint or an index on
note_tags
- Makes sense to ensure uniqueness here
- Should create an index that should make queries faster
- Perhaps unnecessary
- Using
explain query plan
shows lines likeSEARCH TABLE note_tags AS c USING AUTOMATIC COVERING INDEX (tag_id=? AND note_id=?)
, indicating that an index is created automatically already. This is probably through the foreign key constraint that references the id columns of thenotes
andtags
tables, both of which are primary key columns (and so should be indexed) - Though see the sqlite docs on foreign keys where it is suggested an index on the child columns is still useful for when certain actions such as delete occur.
- Using
-
Create CLI
- General a CRUD interface on notes and tags + querying
- Add sql command for straight sql
- add custom query command for custom short hand queries
- Add update specific note/file command
- Add command that starts a python REPL and imports the zekell module ready for interactive usage of the module itself
- Can maybe just run
python -i
on the module itself? - Or maybe some code that also sets up a connection to the database
- Same for sqlite shell?
- Can maybe just run
- Wrap functionality around note staging (mostly CLI)
- Stage note when opened through CLI so that can be readily updated once edited by external app
- List all staged notes
- Batch update all staged notes
- Check ones have actually been updated?? (too much effort really)
- mainly for editing notes purely through CLI
- Can update
full tag path
triggers to use a single CTE? - write the CTE once, and refer back to it in each trigger !!?
- Create query functions/API
- Create a super master query that automatically joins multiple optional queries together
- Maybe use multiple CTEs for each component, then join them all together at the end on a commonly named
note_ids
column (??) - In combining the components of this query, there will be active and passive filtering
- Eg: all notes that are children of note X, or all notes that are children of the notes from the previous component ... how to string all of this together? best approach is probably to put together various specific queries that are obviously useful, to get a feel for how it can all be put together
- Well, passive filtering is done just by joining (kinda easy). Ordinarily, active filtering would involve a
WHERE
within a subquery or CTE.
- Maybe use multiple CTEs for each component, then join them all together at the end on a commonly named
- Allow for simple children of and parents of passive components
- Allow for more flexible and powerful text search in
title
andbody
queries:- Use
like
andglob
on plain columns (not FTS) (see documentation)- Would have to decide on how conveyed in the query text and then combined with other queries
- Probably just require the user to provide 'like' or 'glob' and understand what that means
like
:%
and_
allowedglob
:*
(instead of%
), and otherunix
symbols? (?
,[]
)?- script can scan for these key words, and if present, use them instead of the FTS table.
- Provide interface to use the
bm25()
function to rank search matches (see documentation)- Maybe have on by default? Use keyword at end of search?
- Maybe snippet and highlight are useful too? ... to show the reason why a particular hit shows as a match ... provide some context etc.
- Does this get into fuzzy search territory where a bad match is returned nonetheless?
- Use (?) the experimental trigram tokeniser (see documentation, released
2020-12
: release notes)- enables essentially close to fuzzy search that should be performant for any substring of three characters or more with
glob
/like
functionality too!
- enables essentially close to fuzzy search that should be performant for any substring of three characters or more with
- Would have to decide on how conveyed in the query text and then combined with other queries
- Use
- Relatedly, Modify the
title
andbody
components to always be one CTE- At the moment, they are separate and are joined.
- for some reason, this is a sub-optimal join, performing worse than other queries by orders of magnitude.
- But, if both
where
statements are put into a single CTE, performance is excellent again. - Have some code that looks for both
title
andbody
and constructs a single CTE - To combine them requires only a single
where
statement:where title match "alice" and body match "queen"
- more fully ... see below ...
- Create a super master query that automatically joins multiple optional queries together
with
title_notes(note_id) as (
select rowid from notes_fts
where title match "alice" and body match "queen"
)
select z.id,z.title
from title_notes a
inner join notes z on a.note_id = z.id
- Allow for OR between components (using
outer join
rather thaninner join
)- sqlite doesn't support full outer joins. They can be hacked together with a union of mirroring left joins ... but would it be worth it?
- Allow for complex tag queries
- boolean operations on multiple tags
- Perhaps the toxi scheme is not appropriate for this? It's possible, but perhaps to combersome especially if trying to integrate it with other query components.
- Maybe just using FTS, with a single
tags
column, with space separated tags and slashes (/
) for hierarchy, would work better for querying?- Harder to query what are all the tags in the database then ... so maybe both?
- To remove the slash (
/
) from token separators see docs on unicode tokenizer options where characters can be added to the set of token and separator characters. With this, whitespace would be used to separate tags and slashes would be retained as part of a single token. - Boolean queries would become simpler, using
sql
booleans onlike "%TAG%"
statements (see examples in this blog post)
- Including tag children in query (ie, tag
software
includessoftware/theory
)- Probably best as a separate action from direct tag search ... simple recursive?
- first step is to find all tags that are child of specified tag
- Then, a simple
where tag_id in (...)
should do it.
- Probably best as a separate action from direct tag search ... simple recursive?
-- All children of tag with parent_id ?
with recursive all_tag_children(id, tag, parent_id) as (
-- Optionally including the initial tag too
select id, tag, parent_id from tags
where id = ?
union
select id, tag, parent_id from tags
where parent_id = ?
union
select a.id id, a.tag tag, a.parent_id parent_id
from tags a
inner join all_tag_children b on a.parent_id = b.id
),
tag_children_notes(note_id) as (
select note_id from note_tags
where tag_id in (
select id from all_tag_children
)
)
-- End of CTE
-- selection in the course of full query
select note_id from tag_children_notes
- Add follow-through children (and parent) query commands that select all immediate children of the previously selected notes
- Same for parent (?)
- Add a distinct or group by to eliminate duplicates in children or parents
- from inspections of query plans (with
explain query plan
), distinct or group by make no difference in the plan. Quick timing tests show no observable difference in the timing.
- from inspections of query plans (with
-- For example
-- Previous selection ... simple title FTS
with title_notes(note_id) as (
select rowid from notes_fts where title match "alice"
)
,
-- Actual children notes CTE
-- note that both parent and child note ids are required
-- here child_note_id is aliased to note_id as it will continue along in the chain of joins
children_notes(parent_note_id, note_id) as (
select parent_note_id, child_note_id
from note_links
)
-- selecting and joining
select z.id
from title_notes a
-- This is different from the usual pattern, where the join MUST be on the parent_note_id column
-- whilst the child_note_id continues along as note_id
-- to implement programmatically will require some logic to alter the usual pattern for
-- these "passive" "follow-through" CTEs
inner join children_notes b
on a.note_id = b.parent_note_id
inner join notes z
on b.note_id = z.id
- Querying all children (at all levels) of a note (or even set of notes)
- Need a recursive CTE
- Can limit the number of levels? Could be useful to do so only to 5 levels or so so that the number of notes doesn't explode?
- How handle cycles?
- Not sure can limit number of cycles, but only total number of rows added through an ordinar
LIMIT
statement
- Not sure can limit number of cycles, but only total number of rows added through an ordinar
- Need a recursive CTE
-- Note the limit of 100 in final line
-- This is arbitrary, but a good safety limit makes sense (maybe 1000?)
-- Also note that this requires the seed parent_note_id
-- Could use a like %XXXX fuzzy search for the note_id ?
with recursive all_children(parent_id, note_id) as (
select parent_note_id, child_note_id from note_links
where parent_note_id = ?
union
select parent_note_id, child_note_id
from note_links a
inner join all_children b
on b.note_id = a.parent_note_id
limit 100
)
select note_id from all_children
-
ensure custom query interface is protected from sql insertion attacks
-
Run performance tests for various numbers of notes
- 1000, 5000, 10_000, 50_000, 100_000
- Use instance on AWS
- need code for:
- making dummy notes (like from alice in wonderland)
- running and timing queries
- Probably just use
zekell
CLI withq
search machinery.
- Probably just use
- Setting up and tearing down notes
-
Run performance tests on a subset of wikipedia after scraping (maybe scrape n notes from philosophy) ... maybe just take the introductions of each page only.
- Wikipedia's database is available for download here
- Simple English wikipedia available from here might be a good bet ... whole wiki database is approx ~200MB. This also appears to be
XML
format, so no need formysql
-
Allow for fuzzy search over note ids??
- Idea is to be like git SHAs ... allow using the first 6 digits of a note to search for a ntoe
- 6 digits is pretty easy for short term memory ... essentially Hour:Minute:second, and will most likely be unique or close to
- 12 hours (working time) * 60 minutes * 60 seconds ~ 40_000
- Probably create another FTS column for the row id?
- Actually, this works already with basic
like %TOKEN%
syntax !!!:
select id from notes where id like "%83006%" -- returns (20211031083006)
-
Add run_log for all tables for version control and viable sync options
-
Having running
last_mod_time
for whole database- update with triggers or use python? (probably triggers!)
- Could maybe just use the run_log? Probably not efficient!
- Generally philosophy with this database is make reading fast and writing slow (indices, unique constraints, triggers, etc).
-
Create a web API interface
- could become primary means of interaction
- run always in background
- allow for WAL and multiple connections, which requires another pragma (see documentation)
- requires either hacking the stdlib (too much) or using bottle (minimal)
- could become primary means of interaction
-
Create a frontend (browser based) !!
- Need a markdown system
- eg, some markdown editors:
- Need a markdown system
-
Change modified time column to float (?)
- Idea is to enable easier sorting in the database?
- but sqlite seems quite happy to sort text. In fact, it seems faster compared to sorting floats, which makes some sense as floating point might introduce some problems. ... Note done, left as text!!
-
sublime plugin?
- General Query interface:
- Use plane text input (simple, but prone to errors) ... or ...
- Use List Inputs in combination with
next_inputs
- Initial selection are available keys (
tag
,title
etc) - On selection, another list input (for
tag
for instance, where there are limited options), or, a general text input.- For
id
, a follow-up list input might make sense, which returns the list of possible options if there are more than one (or even when there is only one, just to confirm)
- For
- Initial selection are available keys (
- General Query interface:
- Track web-links separately
- Separate table:
ID | title | URL | note_id (foreign key)
- Parse notes for
markdown
links, taking[this is a link]
text as title. - Maybe ... full text on the title and URL
- where the URL would probably require a custom tokeniser (separate by slashes and/or dots)
- WHY?
- Search for notes based on containing a specific URL (
stackoverflow
,PLOSone
, eg) - Retrieve all links that are children of certain notes
- more complex, as have to scan all leaves
- EG, all links under
python
.
- Search for notes based on containing a specific URL (
- Separate table:
- Create terminal browser/viewer with
Rich
/Textual
- Web server, multi-user, one
SQLite
db file per user, backedup toS3
withLitestream
?