Skip to content

Conversation

@tompng
Copy link
Member

@tompng tompng commented Jan 10, 2026

RDoc::Markup::AttributeManager

RDoc parses block-level structure in RDoc::Markup::Parser and parses inline styling in RDoc::Markup::AttributeManager.
RDoc::Markup::AttributeManager is a string-replacing/macro based parser.
It's very complicated, and its mechanism are the root cause of many bugs. We need to remove and replace it.

It converts to a flow/stream of string and styling changing operation.
Unfortunately, inline styles are structured data, output format (HTML) is also a structured data. There's an architecture mismatch.

Tidylink {label}[url] should be a syntax rule, but it is currently handled in a
regexp-based macro called regexp_handling.
Of course parsing nested structure such as styled tidylink label frequently fails. It's uncontrollable.

Solution

  • Eliminate RDoc::Markup::AttributeManager
  • Create a parser that generates structured data
  • Traverse structured data to generate output instead of string replacing
  • Use controllable regexp-handling macro: only apply to text nodes

New inline styling syntax

Tokens

  • Word pairs +word+ *word* _word_ `word` and so on
  • Standalone tags <br>, <tt>code_text</tt>, <code>code_text</code>
  • Open and close tags <i> </i>, <b> </b>, <em> </em>, <s> </s>, <del> </del>
  • Tidy link opening { and closing }[url_part]
  • Simplified tidylink word[url_part]
  • Text nodes

Regexp handling macro

Matching with CROSSREF, RDOCREF, HYPERLINK regexp should be applied only to text nodes after parsing phase.
Parsed tree modication instead of text-node gsub is also another option to implement this.

Error recovery

RDoc format doesn't have syntax error. We need to make the behavior as similar before.

  • Closing tags and braces will invalidate unclosed tags, and unclosed tags are treated as plain text
    • <a><b><c></a> will be <a>&lt;b&gt;&lt;c&gt;</a>
    • {<a><b>}[url] will be a tidylink with label &lt;a&gt;&lt;b&gt;
  • Unmatched closing tags and braces will be treated as plain text
    • <a></b>}</a> will be <a>&lt;/b&gt;}</a>
  • Tidylink inside tidylink will invalidate outside tidylinks
    • {{inner}[url]}[<b>]</b> will be "{" + inner_tidylink + "}[" + bold("]")

Simplified tidylink

RDoc was converting a*_`+<b>c[foo] to <a href="foo">a*_`+&lt;b&gt;c</a>. This is terrible, it can't coexist with other syntaxes like *word* _word_ +word+.
We should restrict characters and recommend {label}[url].
In this pull request, only Alphanumeric[url] (should start with alphabet) is supported .

Diff

Running rdoc in ruby/ruby, there were 48 HTML files that had changes (with commit hash:763f379).
There are too many diffs, but I think most of them are acceptable.

Some diff

List of all files with diffs:

Array.html	ERB.html	Encoding/Converter.html	Hash.html
IO.html	Kernel.html	Net/HTTP.html	Net/HTTPHeader.html
OpenSSL/Cipher.html	OpenSSL/SSL/SSLSocket.html	OpenSSL/X509/Name.html	OpenSSL.html
PrettyPrint.html	Random/Formatter.html	Regexp.html	Socket.html
String.html	StringScanner.html	Struct.html	SyntaxSuggest/CodeLine.html
Time.html	TracePoint.html	__/ruby/LEGAL.html	__/ruby/doc/NEWS/NEWS-1_9_2.html
__/ruby/doc/NEWS/NEWS-2_2_0.html	__/ruby/doc/NEWS/NEWS-2_4_0.html	__/ruby/doc/NEWS/NEWS-3_0_0_md.html	__/ruby/doc/NEWS/NEWS-3_1_0_md.html
__/ruby/doc/NEWS/NEWS-3_2_0_md.html	__/ruby/doc/NEWS/NEWS-3_3_0_md.html	__/ruby/doc/NEWS/NEWS-3_4_0_md.html	__/ruby/doc/NEWS/NEWS-4_0_0_md.html
__/ruby/doc/contributing/building_ruby_md.html	__/ruby/doc/contributing/concurrency_guide_md.html	__/ruby/doc/contributing/documentation_guide_md.html	__/ruby/doc/contributing/dtrace_probes_rdoc.html
__/ruby/doc/contributing/reporting_issues_md.html	__/ruby/doc/distribution/windows_md.html	__/ruby/doc/extension_ja_rdoc.html	__/ruby/doc/extension_rdoc.html
__/ruby/doc/language/encodings_rdoc.html	__/ruby/doc/language/globals_md.html	__/ruby/doc/language/marshal_rdoc.html	__/ruby/doc/language/strftime_formatting_rdoc.html
__/ruby/doc/security/command_injection_rdoc.html	__/ruby/doc/security/security_rdoc.html	__/ruby/doc/syntax/keywords_rdoc.html	__/ruby/doc/syntax/pattern_matching_rdoc.html

Array.html

expression memo
source A {%w or %W string-array Literal}[rdoc-ref:...]
master <p>A array literal:</p> bug
this PR <p>A %w or %W string-array Literal:</p> fixed

ERB.html

expression memo
source **Embedded Ruby**
master <strong>Embedded Ruby</strong> crossref inside em is disabled
this PR <strong>Embedded <a href="Ruby.html"><code>Ruby</code></a></strong> crossref enabled

Same diff in IO.html, Net/HTTPHeader.html, OpenSSL/Cipher.html and Socket.html

Encoding/Converter.html

expression memo
source "\uFFFD"
master “uFFFD” \u is unescaped
this PR “\uFFFD” backslash remains

Same diff in PrettyPrint.html, Random/Formatter.html

Hash.html

expression memo
source hash same \hash key
master same hash key \h is unescaped
this PR same \hash key backslash remains

Kernel.html

expression memo
source {\`command`}[rdoc-ref:Kernel#`] backquote escaped
master <a href="Kernel.html#method-i-60"><code>command</code></a> Escape ignored(bug)
this PR <a href="Kernel.html#method-i-60">‘command`</a> fixed
expression memo
source Whether the entity's setgid bit is set.
master <td align="left">Whether the entity's setgid bit is set.</td> quote is '
this PR <td align="left">Whether the entity’s setgid bit is set.</td> quote is

Net/HTTP.html

expression memo
source {Proxy Using ENV['http_proxy']}[rdoc-ref:...] Tidylink inside tidylink (invalid)
master <a href="...">Proxy Using <a href="&#39;http_proxy&#39;">ENV</a></a> Bug (</a></a>)
this PR {Proxy Using <a href="&#39;http_proxy&#39;">ENV</a>}[<a href="...">...</a>]

OpenSSL/SSL/SSLSocket.html

expression memo
source _remote\_host_ and _remote\_port_ Wrong escaping
master <em>remote</em>host_ and <em>remote</em>port_ messed
this PR _remote_host_ and _remote_port_ unescaped

OpenSSL/X509/Name.html

expression memo
source or ++1+ if Should be considered closing ++ missing
master <code>+1</code> bug
this PR ++1+ fixed

OpenSSL.html

expression memo
source v2.0[http://www.rsa.com/rsalabs/node.asp?id=2127]
master <a href="http://www.rsa.com/rsalabs/node.asp?id=2127">v2.0</a> Simplified tidylink
this PR v2.0[<a href="http://www.rsa.com/rsalabs/node.asp?id=2127">www.rsa.com/rsalabs/node.asp?id=2127</a>] Simplified tidylink changed

Regexp.html

expression memo
source (see {%r: Regexp Literals}[...])
master (see Regexp Literals): bug
this PR (see %r: Regexp Literals): fixed
expression memo
source {rdoc-ref:Regexp global variables}[rdoc-ref:Regexp@Global+Variables] redundant label
master <a href="Regexp.html#class-regexp-global-variables">Regexp global variables</a>
this PR <a href="Regexp.html#class-regexp-global-variables">rdoc-ref:Regexp global variables</a>

String.html

expression memo
source <b>+String+ +replacement+</b>
master <a href="String.html"><code>String</code></a> <code>replacement</code></strong> </strong> without opening <strong> (bug)
this PR <strong><a href="String.html"><code>String</code></a> <code>replacement</code></strong> fixed
expression memo
source {string literal}[rdoc-ref:syntax/literals.rdoc@String+Literals] or a {here document literal}[rdoc-ref:syntax/literals.rdoc@Here+Document+Literals]
master <p>string literal or a string literal</p> bug in cross_reference cache
this PR <p>string literal or a here document literal</p> bug is somehow avoided

Same diff in Regexp.html

StringScanner.html

expression memo
source [stored string][1]
master [stored <a href="1">string]</a> Considered simplified tidylink
this PR [stored string][1] Considered plain text

(and many more. give up listing)

@tompng tompng temporarily deployed to fork-preview-protection January 10, 2026 21:37 — with GitHub Actions Inactive
@matzbot
Copy link
Collaborator

matzbot commented Jan 10, 2026

🚀 Preview deployment available at: https://0eb62010.rdoc-6cd.pages.dev (commit: 763f379)

Remove complicated string-replacing/macro based parser and change to a normal parser that returns a structured data
@tompng tompng force-pushed the markup_inline_parser branch from 1bb81c8 to b8e0259 Compare January 10, 2026 22:30
@tompng tompng temporarily deployed to fork-preview-protection January 10, 2026 22:30 — with GitHub Actions Inactive
Copy link
Member

@st0012 st0012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the big rewrite 🙏 I couldn't finish review in one pass. I'll give it another look in the next few days.

next handle_TEXT(node) if String === node
case node[:type]
when :TIDYLINK
@in_tidylink = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved into handle_TIDYLINK?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!
I've move this to ToHtml#handle_TIDYLINK and renamed ivar to @in_tidylink_label.

@in_tidylink_label is true while processing tidylink label-part, but false while processing tidylink url-part.
Because rdoc-ref:... in tidylink label should be ignored (to avoid generating nested <a> tag) but rdoc-ref:... inside tidylink url should be handled.

attrs = @attributes.keys
if @current_attributes != attrs && @current_attributes != attrs.sort!
change_attribute(@current_attributes, attrs)
@current_attributes = attrs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be part of change_attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The method and ivar name was misleading.

@attributes is the current attributes to be applied to text
@current_attributes (renamed to @applied_attributes) is the styling applied/flushed to output as escape sequences.
on and off changes attribute
change_attributes (renamed to apply_attributes) will flush the buffered attribute changes

@tompng tompng temporarily deployed to fork-preview-protection January 11, 2026 21:50 — with GitHub Actions Inactive
@tompng tompng temporarily deployed to fork-preview-protection January 12, 2026 10:59 — with GitHub Actions Inactive
@tompng tompng force-pushed the markup_inline_parser branch from 1b7488d to 9eac17b Compare January 12, 2026 16:47
@tompng tompng requested a deployment to fork-preview-protection January 12, 2026 16:47 — with GitHub Actions Waiting
@tompng tompng deployed to fork-preview-protection January 12, 2026 17:42 — with GitHub Actions Active
@st0012 st0012 added the bug label Jan 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants