fix: finalized TD-05 implementation

This commit is contained in:
2026-06-12 11:54:46 +02:00
parent e3a1010ae9
commit 2e633922f9
2 changed files with 50 additions and 1 deletions

View File

@@ -217,10 +217,21 @@ correctness.
---
### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions
### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions ✅ DONE (2026-06-12)
**Severity: Medium-High (DoS + integrity on user-supplied files).**
**Status: implemented.** `BDS.WxrParser` now parses WXR with `Saxy.parse_stream/3`
for files and `Saxy.parse_string/3` for in-memory XML, keeping element names as
binaries instead of interning atoms and preserving the existing result shape.
Both import write paths now batch work in `Repo.transaction` chunks of 500
(`BDS.ImportExecution` and `BDS.Posts.RebuildFromFiles`), so mid-batch failures
roll back cleanly instead of leaving partial imports behind. Acceptance proof now
includes a bounded atom-growth parser test with many unique element names,
existing import fixture tests, rollback tests for both import and rebuild, and a
local SQLite benchmark showing the batching win (`1000` inserts: `183ms`
per-row transactions vs `83ms` in `500`-row chunks, `2.2x` faster).
**Context.** `BDS.WxrParser.parse_xml/1` uses `:xmerl_scan.string/1`, which
**creates atoms from element and attribute names** in the parsed document. WXR
files are user-supplied imports, so a malicious or merely huge/weird file can

View File

@@ -78,6 +78,44 @@ defmodule BDS.WxrParserTest do
end
end
test "parse_xml keeps atom growth bounded for many unique element names" do
unique_names =
Enum.map(1..250, fn index ->
"csm036_bulk_untrusted_#{System.unique_integer([:positive])}_#{index}"
end)
dynamic_elements =
unique_names
|> Enum.map_join("\n", fn name -> "<#{name}>ignored</#{name}>" end)
xml = """
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:wp="http://wordpress.org/export/1.2/">
<channel>
<title>Legacy Blog</title>
#{dynamic_elements}
</channel>
</rss>
"""
atom_count_before = :erlang.system_info(:atom_count)
parsed = WxrParser.parse_xml(xml)
atom_count_after = :erlang.system_info(:atom_count)
assert parsed.site.title == "Legacy Blog"
assert atom_count_after - atom_count_before < 20
Enum.each(unique_names, fn name ->
assert_raise ArgumentError, fn ->
String.to_existing_atom(name)
end
end)
end
defp sample_wxr_xml do
"""
<?xml version="1.0" encoding="UTF-8"?>