Adding durability for ClojureScript#14
Adding durability for ClojureScript#14logseq-cldwalker wants to merge 28 commits intotonsky:masterfrom
Conversation
fix unit tests
Also remove unnecessary changes to project.clj and .gitignore
|
Yes! I initially skipped CLJS implementation because I thought localStorage is too small to be useful, and IndexedDB is async. What are you guys using for storage? |
|
Cool. We're using SQLite via OPFS. From the OPFS approaches that SQLite offers we chose the OPFS pool approach because it offers a sync handle. The async approach proved too difficult and we're unsure if it's doable |
|
Please let me know when it’s ready for review |
|
Sure. This is ready for review |
|
@tonsky Sorry. There was actually still another bug with deletion. Since I'm going to be away on holiday vacation, I'm putting this back until draft until I get back the first week of next year. Any low-effort feedback would be welcome. Cheers |
|
Sure. Will give it a look |
tonsky
left a comment
There was a problem hiding this comment.
Overall looks great, but I have couple of notes
| (defprotocol IStorage | ||
| (restore [this address]) | ||
| (accessed [this address]) | ||
| (store [this node address]) |
There was a problem hiding this comment.
I’m not sure what address here does? Clojure version only takes node and storage chooses and returns address, so that it can be compatible with e.g. auto-increment. I think we should keep it consistent between versions
There was a problem hiding this comment.
The main idea here is to reuse the existing address to reduce the storage usage and get rid of gc if possible.
We observed 0.1MB increases for db.sqlite when editing one block in Logseq.
I agree that we should keep it consistent between versions, maybe someone can implement the same idea with the Clojure version, or I can remove both address and _dirty.
There was a problem hiding this comment.
Wait, you can’t reuse addresses though? Because user can keep a reference to an old version of database that is referencing the old version of the block? If you rewrite it the old db ref will stop working?
There’s actually a whole deal about it in Java version, you kind of need to know all alive references to run GC and not break any of them. That was the initial promise of Datomic and Datascript — dbs are immutable, if you keep a reference to them they’ll keep working
There was a problem hiding this comment.
I'm sorry for not getting back to you sooner.
What I did is to keep using any existing address instead of generating a new address for each branch node and leaf, it seems that the Clojure version of Datascript has to remove all the unused addresses during GC here, those unused addresses increases the storage usage unless they're garbage collected. We aim to delete unused addresses immediately instead of delaying to GC to reduce the storage overhead.
| (deftype Leaf [keys ^:mutable _address ^:mutable _dirty] | ||
| IStore | ||
| (store-aux [this storage] | ||
| (if (or _dirty (nil? _address)) |
There was a problem hiding this comment.
What is the idea behind _dirty flag here? In Clojure version, if node has address, it is persisted. If address is nil, it is not persisted (==dirty?). Do we really need two separate flags for this?
|
@logseq-cldwalker What are your main insights into asynchronous support? @pkpkpk and I have also started working on async version of the persistent-sorted-set. I think both async and sync execution models have benefits and drawbacks. |
|
@tonsky Thanks for looking into this PR and all the beautiful work for the Clojure community! |
|
@whilo Hey! It'll be nice to have async support so that people can choose to store the data in IndexedDB. |
|
@tiensonqin thanks to you for your consistent support and for such a significant contribution in such a tricky part of the system! |
|
@tiensonqin Thank you for laying that out, that makes sense. A solution I originally developed for replikativ with the hitchhiker-tree and we are currently revisiting (replikativ/datahike#429) is to stream tree fragment deltas incrementally and then update the db root after everything is in store/storage. That way you can realize your synchronous DataScript scenario. You just need to transact first into a storage system somewhere and then react to its confirmation/updates. I think this would be a simple and nice model to synchronize logseq, but I don't know whether you want to treat the markdown files or the DataScript as the primary source of truth. |
|
@tonsky Are there things you're waiting for from us? I think Tienson addressed most of the feedback |
|
Oops, sorry, no. I’ll take a look |
|
Okay, I think I finally understand what are you doing. Sorry it took so long to catch up. I like the idea! It can’t be the default mode though, default should be normal persistent data structure where you can keep references to old copies. But as an option, I’d like to have it too. I can imagine an app that can only keep one reference to the latest DB at all times. If that eliminates GC, I see how it is beneficial. So let’s say we want to get rid of GC at all. Right now you have two behaviours: some addressed get erased on next I propose we move it all the next store.
So it’s not exactly address reuse, more like freeing addresses as we go and allocating new ones. If a storage doesn’t want to clean up freed addresses immediately, it can make the implementation noop. Would that work for you? |
|
Tienson is out right now for Chinese New Year. Hopefully he can respond soon after he gets back. In the meantime, we've been able to use this PR successfully on databases up to 9.3M datoms which translates to ~1.4G on disk |
|
Awesome! |
Add three new methods to IStorage interface (all with default no-op): - delete(Collection<Address>) - batch delete addresses from storage - markFreed(Address) - mark an address as obsolete during modifications - deleteFreed() - delete all marked addresses at end of batch operation This enables storage implementations to track and reclaim storage space during large batch operations, preventing garbage accumulation. Inspired by tonsky#14 (CLJS durability PR).
* Add auto-removal interface for obsolete addresses Add three new methods to IStorage interface (all with default no-op): - delete(Collection<Address>) - batch delete addresses from storage - markFreed(Address) - mark an address as obsolete during modifications - deleteFreed() - delete all marked addresses at end of batch operation This enables storage implementations to track and reclaim storage space during large batch operations, preventing garbage accumulation. Inspired by tonsky#14 (CLJS durability PR). * Mark obsolete addresses during tree modifications Call storage.markFreed() when nodes are replaced during cons, disjoin, and replace operations. This allows storage implementations to track obsolete addresses for later deletion, preventing garbage accumulation during batch operations. - Branch.java: Mark child addresses when replaced in add/remove/replace - PersistentSortedSet.java: Mark root address when tree structure changes - Tests: Add automatic marking tests for conj and disj operations * Fix auto-removal in editable (transient) mode Two bugs were causing addresses to not be marked as freed in transient mode: 1. add() editable case: No markFreed call at all when replacing a child 2. replace() editable case: markFreed was called AFTER child() which already nullified the address, so the check always failed Both fixes: Mark the old address BEFORE calling child(idx, node) since child() internally sets _addresses[idx] = null. This fixes "Node not found" errors during batch indexing where addresses were incorrectly deleted because they weren't marked as freed during transient tree modifications. * Track freed addresses for optional immediate gc. * Fix markFreed to work in both persistent and transient modes This completes the auto-removal implementation for both CLJ and CLJS: Java (PersistentSortedSet.java): - Move markFreed calls BEFORE editable checks in cons/disjoin/replace - This enables marking in both persistent and transient modes - Pattern: check storage and address exist, then call markFreed CLJS (btset.cljs): - Add markFreed calls to $conjoin, $disjoin, $replace - Mirror Java implementation pattern - Add markFreed to IStorage protocol (storage.cljs) Tests: - Fix auto_removal.clj: convert delete/deleteFreed from defrecord methods to standalone functions (Java interfaces don't allow extras) - Add auto_removal.cljs with matching test coverage - Add debug output to both CLJ and CLJS tests - All 8 core tests pass in CLJ (26 assertions) Test infrastructure: - Add generative.cljc: cross-platform generative tests - Add structural_invariants.cljc: tree property verification - Add ref_stress.clj: SoftReference/WeakReference eviction tests Ignore .cljs_node_repl/ build artifacts * Add test.check. * Factor out conjAll tests, unify marking protocols. * Factor out ref test. * Complete cljs mark free implementation. * Add missing protocols. * Remove debug instrumentation. * Update README.
Hi @tonsky. Thanks for this handy library! This is a conversational PR to see if you'd be interested in our ClojureScript durability that @tiensonqin added as part of our datascript.storage cljs implementation. We are using this forked datascript in our product's feature branch and it is working well for our needs. The existing cljs tests on this repo are passing as our the relevant upstream datascript.storage cljs tests. I'm aware there's some minor whitespace and commenting that needs to be cleaned up here as well as cljs storage tests that need to be ported. I could help with those but for any design questions I'd defer to @tiensonqin. Would you be interested in a contribution like this?