# MacroDown Design Document
## 1. Overview
MacroDown is a C++ Markdown processor that extends the CommonMark syntax with a TeX-like macro system. The core philosophy is that **all** markup elements (headers, emphasis, links) are treated as syntactic sugar for macro calls. The processor parses the document into a tree of macro invocations and evaluates them using a standard library of macro definitions to produce HTML.
## 2. Architecture
The system operates in a linear pipeline:
```mermaid
graph TD
Input[Source Text] --> BlockParser[Block Parsing (Phase 1)]
BlockParser --> BlockTree[Block Tree]
BlockTree --> InlineParser[Inline Parsing (Phase 2)]
InlineParser --> MacroAST[Macro Syntax Tree]
MacroAST --> Evaluator[Macro Evaluator]
Evaluator --> HTML[Output HTML]
```
### 2.1 The Macro System
The core logic revolves around Macros.
* **Definition Syntax**: `%def[name]{arg1, arg2, ...}{body}`
* Example: `%def[my_macro]{t1, t2}{It’s a %em{%t1} that is %t2.}`
* **Call Syntax**: `%name{arg1}{arg2}...`
* Example: `%my_macro{test}{good}`
* **Expansion**: The Evaluator recursively expands macros until only text remains.
### 2.2 CommonMark Parsing Strategy
We strictly follow the CommonMark "Appendix A" strategy:
1. **Phase 1 (Block Structure)**: Analyze the document line-by-line to construct a tree of Blocks (Paragraphs, Lists, Blockquotes). This handles nesting and indentation.
2. **Phase 2 (Inline Structure)**: Walk the Block tree and parse the text content of leaf blocks into Inline elements (Emphasis, Links, Code).
3. **Transformation**: Convert the CommonMark Block/Inline tree into the unified **Macro AST**.
* `# Heading` $\rightarrow$ `%h1{Heading}`
* `*Bold*` $\rightarrow$ `%em{Bold}`
### 2.3 Custom Markups
The system supports user-defined custom markups that map to macros. The content of the markup is determined by a regular expression.
* **Prefix Markup**: Starts with a specific character (e.g., `#tag`) and captures text matching a regex pattern. By default, it ends at a whitespace or punctuation boundary (except `_`, `-`, `@`, and `.`).
* Example: `#tag` $\rightarrow$ `%tag_macro{tag}`
* **Delimited Markup**: Starts and ends with the same character (e.g., `:highlight:`) and captures text matching a regex pattern. No whitespace is allowed inside.
* Example: `:highlight:` $\rightarrow$ `%highlight_macro{highlight}`
## 3. Data Structures
### 3.1 AST Nodes
The final tree consists of a unified `Node` type using `std::variant` to hold different data types:
```cpp
struct Text {
std::string content;
};
struct Macro {
std::string name;
std::vector<std::unique_ptr<Node>> arguments;
bool is_special = false;
};
struct Group {
std::vector<std::unique_ptr<Node>> children;
};
struct Node {
using Data = std::variant<Text, Macro, Group>;
Data data;
// Call function on each node in the tree (pre-order traversal).
// The callback function takes const Node& as an argument.
template<typename Callback>
void forEach(Callback f) const;
};
```
The `forEach` method provides a way to iterate over all nodes in the tree, including children of `Group` nodes and arguments of `Macro` nodes.
### 3.2 Block Structure (Intermediate)
During Phase 1, we use a structure mirroring CommonMark blocks to maintain state (open/closed blocks, list types).
```cpp
enum class BlockType {
Document,
Quote,
List,
ListItem,
FencedCode,
IndentedCode,
HtmlBlock,
Paragraph,
Heading,
ThematicBreak,
// ... potentially others
};
struct Block {
BlockType type;
std::vector<std::unique_ptr<Block>> children; // For container blocks
std::string literal_content; // For leaf blocks (raw text to be parsed later)
int level = 0; // For headings
// ... metadata for parsing state
};
```
### 3.3 Unicode Strategy
* **Library**: Use [uni-algo](https://github.com/uni-algo/uni-algo) for all Unicode-related operations.
* **Storage**: All text will be stored in `std::string` assuming **UTF-8** encoding.
* **Operations**:
* **Iteration**: Use `uni::iter::utf8` for safe code point traversal.
* **Properties**: Use `uni::is_space` and `uni::is_punct` (or equivalent category checks) to comply with CommonMark's definitions of whitespace and punctuation.
* **Optimization**: Byte-by-byte scanning will still be used for performance when looking for ASCII-only delimiters (`%`, `{`, `}`, `[`, `]`).
### 3.4 Custom Markup Definitions
Users can define custom markups that are transformed into macros during the inline parsing phase.
```cpp
struct PrefixMarkup {
std::string prefix; // The trigger character(s)
std::string macro_name; // Target macro to transform into
std::string pattern; // Regex pattern for the marked-up text
};
struct DelimitedMarkup {
std::string delimiter; // The character used for start and end
std::string macro_name; // Target macro to transform into
std::string pattern; // Regex pattern for the content between delimiters
};
```
## 4. Component Design
### 4.0 Top-level Interface (`MacroDown`)
The `MacroDown` class provides a simplified two-step interface for rendering documents, as required by the specification.
* **Step 1: Parse** (`parse`): Takes a source string and returns a single root `Node` (the syntax tree).
* **Step 2: Render** (`render`): Takes the root `Node` and produces the final HTML string using the internal `Evaluator`.
* **Configuration**: Allows defining custom markups via `definePrefixMarkup` and `defineDelimitedMarkup`.
It automatically initializes the standard library of macros.
### 4.1 Block Parser (`BlockParser`)
* **Input**: Line iterator.
* **Mechanism**:
* Maintains a stack of "Open" blocks.
* For each line, determines which open blocks match the line's indentation/markers.
* Closes unmatched blocks and opens new ones.
* Adds text to the currently open leaf block.
* **Output**: Root `Block`.
### 4.2 Inline Parser (`InlineParser`)
* **Input**: `literal_content` string from a Block.
* **Mechanism**:
* Scans for delimiters (`*`, `_`, `[`, `` ` ``, `!`).
* **Crucially**: Scans for the Macro start character `%`.
* **Custom Markups**: Scans for user-defined prefix and delimited markups.
* Uses the "Delimiter Stack" algorithm from CommonMark spec to resolve emphasis nesting.
* **Output**: Converts the block's text into a list of `Node`s (Text and Macro nodes).
### 4.3 Evaluator (`Evaluator`)
* **Environment**: A map `std::map<std::string, MacroDefinition>`.
* **Mechanism**:
* Traverses the `Node` tree.
* If it's a `Text` node, append to output.
* If it's a `Macro` node:
* Look up definition.
* Bind arguments.
* Parse the definition body (if it's a user macro) or execute C++ callback (if intrinsic).
* Recursively evaluate the result.
* If it's a `Group` node, recursively evaluate all children.
## 5. The Standard Library
The system will boot with a "Prelude" of defined macros to support Markdown features.
| Markdown Element | Macro Signature | HTML Expansion |
| :--- | :--- | :--- |
| Header 1 | `%h1{content}` | `<h1>%content</h1>` |
| Paragraph | `%p{content}` | `<p>%content</p>` |
| Emphasis | `%em{content}` | `<em>%content</em>` |
| Strong | `%strong{content}` | `<strong>%content</strong>` |
| Link | `%link{url}{text}` | `<a href="%url">%text</a>` |
| Image | `%img{url}{alt}` | `<img src="%url" alt="%alt" />` |
| List Item | `%li{content}` | `<li>%content</li>` |
| Unordered List | `%ul{content}` | `<ul>%content</ul>` |
| Code | `%code{content}` | `<code>%content</code>` |
| Blockquote | `%quote{content}` | `<blockquote>%content</blockquote>` |
## 6. Implementation Plan
### Phase 1: Core Setup
* CMake build system.
* `Node` class hierarchy.
* Basic `Evaluator` for text-only nodes.
### Phase 2: Macro Engine
* Implement `%def`.
* Implement parsing of `%call{args}`.
* Unit tests for macro expansion logic.
### Phase 3: Block Parsing
* Implement the "Container Block" and "Leaf Block" logic.
* Handle simple paragraphs and ATX headings (`#`).
* Convert these Blocks into `%p` and `%h1` macros.
### Phase 4: Inline Parsing
* Implement `InlineParser` to handle text.
* Add support for `*em*` and `**strong**` mapping to macros.
* Integrate `%macro` parsing within normal text.
### Phase 5: Standard Library & HTML
* Implement the C++ callbacks or default definitions for the standard library macros.
* Finalize `main.cpp` CLI.
## 7. Build System
* **CMake**: 3.24+
## 8. Coding Conventions
* **File Names**: `snake_case` (e.g., `macro_engine.h`, `block_parser.cpp`).
* **Classes and Types**: `CapCase` (e.g., `Macro`, `BlockType`).
* **Variables**: `snake_case` (e.g., `literal_content`, `is_special`).
* **Global Constants**: `UPPER_CASE` (e.g., `MAX_RECURSION_DEPTH`).
* **Functions**:
* `camelCase` for multi-word names (e.g., `evaluateMacro`).
* `lower case` for single-word names (e.g., `type()`, `evaluate()`).
* **Indentation**: Indent by 4 spaces. Left brace in new line.
(Unless there is nothing inside the brace.)
* **Space before parenthesis: no space.
Example:
```c++
if(...)
{
...;
}
void f() {}
```