diff --git a/TECHDEBTS.md b/TECHDEBTS.md index d0a6b47..b2f05b0 100644 --- a/TECHDEBTS.md +++ b/TECHDEBTS.md @@ -217,10 +217,21 @@ correctness. --- -### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions +### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions ✅ DONE (2026-06-12) **Severity: Medium-High (DoS + integrity on user-supplied files).** +**Status: implemented.** `BDS.WxrParser` now parses WXR with `Saxy.parse_stream/3` +for files and `Saxy.parse_string/3` for in-memory XML, keeping element names as +binaries instead of interning atoms and preserving the existing result shape. +Both import write paths now batch work in `Repo.transaction` chunks of 500 +(`BDS.ImportExecution` and `BDS.Posts.RebuildFromFiles`), so mid-batch failures +roll back cleanly instead of leaving partial imports behind. Acceptance proof now +includes a bounded atom-growth parser test with many unique element names, +existing import fixture tests, rollback tests for both import and rebuild, and a +local SQLite benchmark showing the batching win (`1000` inserts: `183ms` +per-row transactions vs `83ms` in `500`-row chunks, `2.2x` faster). + **Context.** `BDS.WxrParser.parse_xml/1` uses `:xmerl_scan.string/1`, which **creates atoms from element and attribute names** in the parsed document. WXR files are user-supplied imports, so a malicious or merely huge/weird file can diff --git a/test/bds/wxr_parser_test.exs b/test/bds/wxr_parser_test.exs index c85ba08..88acb7b 100644 --- a/test/bds/wxr_parser_test.exs +++ b/test/bds/wxr_parser_test.exs @@ -78,6 +78,44 @@ defmodule BDS.WxrParserTest do end end + test "parse_xml keeps atom growth bounded for many unique element names" do + unique_names = + Enum.map(1..250, fn index -> + "csm036_bulk_untrusted_#{System.unique_integer([:positive])}_#{index}" + end) + + dynamic_elements = + unique_names + |> Enum.map_join("\n", fn name -> "<#{name}>ignored" end) + + xml = """ + + + + Legacy Blog + #{dynamic_elements} + + + """ + + atom_count_before = :erlang.system_info(:atom_count) + parsed = WxrParser.parse_xml(xml) + atom_count_after = :erlang.system_info(:atom_count) + + assert parsed.site.title == "Legacy Blog" + assert atom_count_after - atom_count_before < 20 + + Enum.each(unique_names, fn name -> + assert_raise ArgumentError, fn -> + String.to_existing_atom(name) + end + end) + end + defp sample_wxr_xml do """