fix: finalized TD-05 implementation
This commit is contained in:
13
TECHDEBTS.md
13
TECHDEBTS.md
@@ -217,10 +217,21 @@ correctness.
|
||||
|
||||
---
|
||||
|
||||
### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions
|
||||
### TD-05: Replace xmerl with Saxy in the WXR importer; add import transactions ✅ DONE (2026-06-12)
|
||||
|
||||
**Severity: Medium-High (DoS + integrity on user-supplied files).**
|
||||
|
||||
**Status: implemented.** `BDS.WxrParser` now parses WXR with `Saxy.parse_stream/3`
|
||||
for files and `Saxy.parse_string/3` for in-memory XML, keeping element names as
|
||||
binaries instead of interning atoms and preserving the existing result shape.
|
||||
Both import write paths now batch work in `Repo.transaction` chunks of 500
|
||||
(`BDS.ImportExecution` and `BDS.Posts.RebuildFromFiles`), so mid-batch failures
|
||||
roll back cleanly instead of leaving partial imports behind. Acceptance proof now
|
||||
includes a bounded atom-growth parser test with many unique element names,
|
||||
existing import fixture tests, rollback tests for both import and rebuild, and a
|
||||
local SQLite benchmark showing the batching win (`1000` inserts: `183ms`
|
||||
per-row transactions vs `83ms` in `500`-row chunks, `2.2x` faster).
|
||||
|
||||
**Context.** `BDS.WxrParser.parse_xml/1` uses `:xmerl_scan.string/1`, which
|
||||
**creates atoms from element and attribute names** in the parsed document. WXR
|
||||
files are user-supplied imports, so a malicious or merely huge/weird file can
|
||||
|
||||
@@ -78,6 +78,44 @@ defmodule BDS.WxrParserTest do
|
||||
end
|
||||
end
|
||||
|
||||
test "parse_xml keeps atom growth bounded for many unique element names" do
|
||||
unique_names =
|
||||
Enum.map(1..250, fn index ->
|
||||
"csm036_bulk_untrusted_#{System.unique_integer([:positive])}_#{index}"
|
||||
end)
|
||||
|
||||
dynamic_elements =
|
||||
unique_names
|
||||
|> Enum.map_join("\n", fn name -> "<#{name}>ignored</#{name}>" end)
|
||||
|
||||
xml = """
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<rss version="2.0"
|
||||
xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
|
||||
xmlns:content="http://purl.org/rss/1.0/modules/content/"
|
||||
xmlns:dc="http://purl.org/dc/elements/1.1/"
|
||||
xmlns:wp="http://wordpress.org/export/1.2/">
|
||||
<channel>
|
||||
<title>Legacy Blog</title>
|
||||
#{dynamic_elements}
|
||||
</channel>
|
||||
</rss>
|
||||
"""
|
||||
|
||||
atom_count_before = :erlang.system_info(:atom_count)
|
||||
parsed = WxrParser.parse_xml(xml)
|
||||
atom_count_after = :erlang.system_info(:atom_count)
|
||||
|
||||
assert parsed.site.title == "Legacy Blog"
|
||||
assert atom_count_after - atom_count_before < 20
|
||||
|
||||
Enum.each(unique_names, fn name ->
|
||||
assert_raise ArgumentError, fn ->
|
||||
String.to_existing_atom(name)
|
||||
end
|
||||
end)
|
||||
end
|
||||
|
||||
defp sample_wxr_xml do
|
||||
"""
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
|
||||
Reference in New Issue
Block a user