Adding durability for ClojureScript by logseq-cldwalker · Pull Request #14 · tonsky/persistent-sorted-set

logseq-cldwalker · 2023-12-14T15:26:11Z

Hi @tonsky. Thanks for this handy library! This is a conversational PR to see if you'd be interested in our ClojureScript durability that @tiensonqin added as part of our datascript.storage cljs implementation. We are using this forked datascript in our product's feature branch and it is working well for our needs. The existing cljs tests on this repo are passing as our the relevant upstream datascript.storage cljs tests. I'm aware there's some minor whitespace and commenting that needs to be cleaned up here as well as cljs storage tests that need to be ported. I could help with those but for any design questions I'd defer to @tiensonqin. Would you be interested in a contribution like this?

fix unit tests

Also remove unnecessary changes to project.clj and .gitignore

tonsky · 2023-12-16T19:37:36Z

Yes! I initially skipped CLJS implementation because I thought localStorage is too small to be useful, and IndexedDB is async. What are you guys using for storage?

logseq-cldwalker · 2023-12-18T16:39:14Z

Cool. We're using SQLite via OPFS. From the OPFS approaches that SQLite offers we chose the OPFS pool approach because it offers a sync handle. The async approach proved too difficult and we're unsure if it's doable

tonsky · 2023-12-19T16:22:33Z

Please let me know when it’s ready for review

logseq-cldwalker · 2023-12-19T22:15:53Z

Sure. This is ready for review

logseq-cldwalker · 2023-12-21T14:38:33Z

@tonsky Sorry. There was actually still another bug with deletion. Since I'm going to be away on holiday vacation, I'm putting this back until draft until I get back the first week of next year. Any low-effort feedback would be welcome. Cheers

tonsky · 2023-12-21T16:50:26Z

Sure. Will give it a look

tonsky

Overall looks great, but I have couple of notes

src-clojure/me/tonsky/persistent_sorted_set.cljs

tonsky · 2023-12-27T20:32:08Z

src-clojure/me/tonsky/persistent_sorted_set/protocol.cljs

+(defprotocol IStorage
+  (restore [this address])
+  (accessed [this address])
+  (store [this node address])


I’m not sure what address here does? Clojure version only takes node and storage chooses and returns address, so that it can be compatible with e.g. auto-increment. I think we should keep it consistent between versions

The main idea here is to reuse the existing address to reduce the storage usage and get rid of gc if possible.
We observed 0.1MB increases for db.sqlite when editing one block in Logseq.

I agree that we should keep it consistent between versions, maybe someone can implement the same idea with the Clojure version, or I can remove both address and _dirty.

Wait, you can’t reuse addresses though? Because user can keep a reference to an old version of database that is referencing the old version of the block? If you rewrite it the old db ref will stop working?

There’s actually a whole deal about it in Java version, you kind of need to know all alive references to run GC and not break any of them. That was the initial promise of Datomic and Datascript — dbs are immutable, if you keep a reference to them they’ll keep working

I'm sorry for not getting back to you sooner.
What I did is to keep using any existing address instead of generating a new address for each branch node and leaf, it seems that the Clojure version of Datascript has to remove all the unused addresses during GC here, those unused addresses increases the storage usage unless they're garbage collected. We aim to delete unused addresses immediately instead of delaying to GC to reduce the storage overhead.

tonsky · 2023-12-27T20:35:00Z

src-clojure/me/tonsky/persistent_sorted_set.cljs

+(deftype Leaf [keys ^:mutable _address ^:mutable _dirty]
+  IStore
+  (store-aux [this storage]
+    (if (or _dirty (nil? _address))


What is the idea behind _dirty flag here? In Clojure version, if node has address, it is persisted. If address is nil, it is not persisted (==dirty?). Do we really need two separate flags for this?

src-clojure/me/tonsky/persistent_sorted_set.cljs

whilo · 2024-01-03T18:04:29Z

@logseq-cldwalker What are your main insights into asynchronous support? @pkpkpk and I have also started working on async version of the persistent-sorted-set. I think both async and sync execution models have benefits and drawbacks.

tiensonqin · 2024-01-04T12:35:24Z

@tonsky Thanks for looking into this PR and all the beautiful work for the Clojure community!

tiensonqin · 2024-01-04T12:43:04Z

@whilo Hey!
I think the main reason for Logseq is that it'll require tons of effort to migrate the existing code base to be asynchronous, we still face the challenge that OPFS with SQLite can only be used in a web worker, which is great not to block the main UI thread, but it means both queries and transact will be async, we'll experiment soon with the idea to have an in-memory datascript db in the main UI thread for caching and sync the data with the full db from the worker.

It'll be nice to have async support so that people can choose to store the data in IndexedDB.

tonsky · 2024-01-04T19:40:17Z

@tiensonqin thanks to you for your consistent support and for such a significant contribution in such a tricky part of the system!

whilo · 2024-01-05T10:36:14Z

@tiensonqin Thank you for laying that out, that makes sense. A solution I originally developed for replikativ with the hitchhiker-tree and we are currently revisiting (replikativ/datahike#429) is to stream tree fragment deltas incrementally and then update the db root after everything is in store/storage. That way you can realize your synchronous DataScript scenario. You just need to transact first into a storage system somewhere and then react to its confirmation/updates. I think this would be a simple and nice model to synchronize logseq, but I don't know whether you want to treat the markdown files or the DataScript as the primary source of truth.

logseq-cldwalker · 2024-01-16T22:32:18Z

@tonsky Are there things you're waiting for from us? I think Tienson addressed most of the feedback

tonsky · 2024-01-18T16:34:41Z

Oops, sorry, no. I’ll take a look

tonsky · 2024-02-01T17:13:50Z

@tiensonqin

Okay, I think I finally understand what are you doing. Sorry it took so long to catch up.

I like the idea! It can’t be the default mode though, default should be normal persistent data structure where you can keep references to old copies.

But as an option, I’d like to have it too. I can imagine an app that can only keep one reference to the latest DB at all times. If that eliminates GC, I see how it is beneficial.

So let’s say we want to get rid of GC at all. Right now you have two behaviours: some addressed get erased on next store (the ones that are getteing reused, marked with _dirty), some are erased immediately as database is changed (delete storage [unused-addresses])

I propose we move it all the next store.

Add some sort of queue of freed up addresses to the top level of the tree (through dynamic var?)
When node gets changed/split/merged, its address is added to that queue. Node itself does not remember last address, it just gets set to nil, same as in clj implementation. This will let us get rid of _dirty flag (has .-address === stored).
When the time comes to store new version of the set, we first do (delete storage unused-addresses) and then let the storage allocate new addresses. Upside: it can happen in a batch call.

So it’s not exactly address reuse, more like freeing addresses as we go and allocating new ones.

If a storage doesn’t want to clean up freed addresses immediately, it can make the implementation noop.

Would that work for you?

logseq-cldwalker · 2024-02-12T20:23:56Z

Tienson is out right now for Chinese New Year. Hopefully he can respond soon after he gets back. In the meantime, we've been able to use this PR successfully on databases up to 9.3M datoms which translates to ~1.4G on disk

tonsky · 2024-02-13T12:14:41Z

Awesome!

Add three new methods to IStorage interface (all with default no-op): - delete(Collection<Address>) - batch delete addresses from storage - markFreed(Address) - mark an address as obsolete during modifications - deleteFreed() - delete all marked addresses at end of batch operation This enables storage implementations to track and reclaim storage space during large batch operations, preventing garbage accumulation. Inspired by tonsky#14 (CLJS durability PR).

* Add auto-removal interface for obsolete addresses Add three new methods to IStorage interface (all with default no-op): - delete(Collection<Address>) - batch delete addresses from storage - markFreed(Address) - mark an address as obsolete during modifications - deleteFreed() - delete all marked addresses at end of batch operation This enables storage implementations to track and reclaim storage space during large batch operations, preventing garbage accumulation. Inspired by tonsky#14 (CLJS durability PR). * Mark obsolete addresses during tree modifications Call storage.markFreed() when nodes are replaced during cons, disjoin, and replace operations. This allows storage implementations to track obsolete addresses for later deletion, preventing garbage accumulation during batch operations. - Branch.java: Mark child addresses when replaced in add/remove/replace - PersistentSortedSet.java: Mark root address when tree structure changes - Tests: Add automatic marking tests for conj and disj operations * Fix auto-removal in editable (transient) mode Two bugs were causing addresses to not be marked as freed in transient mode: 1. add() editable case: No markFreed call at all when replacing a child 2. replace() editable case: markFreed was called AFTER child() which already nullified the address, so the check always failed Both fixes: Mark the old address BEFORE calling child(idx, node) since child() internally sets _addresses[idx] = null. This fixes "Node not found" errors during batch indexing where addresses were incorrectly deleted because they weren't marked as freed during transient tree modifications. * Track freed addresses for optional immediate gc. * Fix markFreed to work in both persistent and transient modes This completes the auto-removal implementation for both CLJ and CLJS: Java (PersistentSortedSet.java): - Move markFreed calls BEFORE editable checks in cons/disjoin/replace - This enables marking in both persistent and transient modes - Pattern: check storage and address exist, then call markFreed CLJS (btset.cljs): - Add markFreed calls to $conjoin, $disjoin, $replace - Mirror Java implementation pattern - Add markFreed to IStorage protocol (storage.cljs) Tests: - Fix auto_removal.clj: convert delete/deleteFreed from defrecord methods to standalone functions (Java interfaces don't allow extras) - Add auto_removal.cljs with matching test coverage - Add debug output to both CLJ and CLJS tests - All 8 core tests pass in CLJ (26 assertions) Test infrastructure: - Add generative.cljc: cross-platform generative tests - Add structural_invariants.cljc: tree property verification - Add ref_stress.clj: SoftReference/WeakReference eviction tests Ignore .cljs_node_repl/ build artifacts * Add test.check. * Factor out conjAll tests, unify marking protocols. * Factor out ref test. * Complete cljs mark free implementation. * Add missing protocols. * Remove debug instrumentation. * Update README.

tiensonqin and others added 21 commits September 19, 2023 16:42

Initial implementation for cljs storage support

b51137b

fix: protocol calls

e6719a4

fix: add back javac-options

5eb0cad

Rename PersistentSortedSet back to BTSet

96244b1

def BTSet

fc8402e

fix typo

f1aa6bd

add target classes

8020ddb

remove impl

94bf28c

fix: make sure keys/addresses/chilren have the same length

f114487

Remove debug asserts

8878430

fix: can skip passing storage to set/store

93b471d

fix: make storage mutable

5801fbc

fix: mutable storage

37d818c

fix: restore node in rpath

4c08766

fix: wrong arguments

f777ffd

enhance: lazy load indexes

068a99f

fix: add missing mutable

095ef73

fix: add IRoot protocol

b98540b

typo

86ccc18

fix: ensure root node exists

647779d

fix unit tests

Remove accidental target classes introduced

ea0e507

Also remove unnecessary changes to project.clj and .gitignore

logseq-cldwalker marked this pull request as draft December 14, 2023 15:26

Add delete fn

78fd0ee

tiensonqin added 2 commits December 19, 2023 20:45

perf: reuse node address as possible as we can

2ef401b

fix: mutable dirty? to avoid saving nodes

efd90bd

logseq-cldwalker marked this pull request as ready for review December 19, 2023 22:13

tiensonqin added 2 commits December 21, 2023 18:52

fix: safe delete for unused addresses

9315caa

fix: release bug

081320b

logseq-cldwalker marked this pull request as draft December 21, 2023 14:38

tonsky reviewed Dec 27, 2023

View reviewed changes

Update new-node and new-leaf to be multi-arity fns

85f7bb3

logseq-cldwalker marked this pull request as ready for review January 4, 2024 12:56

fix: node could be missing

d1c8ac4

Conversation

logseq-cldwalker commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tonsky commented Dec 16, 2023

Uh oh!

logseq-cldwalker commented Dec 18, 2023

Uh oh!

tonsky commented Dec 19, 2023

Uh oh!

logseq-cldwalker commented Dec 19, 2023

Uh oh!

logseq-cldwalker commented Dec 21, 2023

Uh oh!

tonsky commented Dec 21, 2023

Uh oh!

tonsky left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tonsky Dec 27, 2023

Choose a reason for hiding this comment

Uh oh!

tiensonqin Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonsky Jan 19, 2024

Choose a reason for hiding this comment

Uh oh!

tiensonqin Jan 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tonsky Dec 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

whilo commented Jan 3, 2024

Uh oh!

tiensonqin commented Jan 4, 2024

Uh oh!

tiensonqin commented Jan 4, 2024

Uh oh!

tonsky commented Jan 4, 2024

Uh oh!

whilo commented Jan 5, 2024

Uh oh!

logseq-cldwalker commented Jan 16, 2024

Uh oh!

tonsky commented Jan 18, 2024

Uh oh!

tonsky commented Feb 1, 2024

Uh oh!

logseq-cldwalker commented Feb 12, 2024

Uh oh!

tonsky commented Feb 13, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

logseq-cldwalker commented Dec 14, 2023 •

edited

Loading

tiensonqin Jan 4, 2024 •

edited

Loading

tiensonqin Jan 31, 2024 •

edited

Loading