Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?
If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml
and the Python built-in xml.etree.ElementTree
?
I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.
* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.
What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.
The way iterparse
(both from lxml.etree
and xml.etree.ElementTree
) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).
But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read
method of that file-like object. (As opposed to the output of .readline
as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.
.read
seems to have an optional parameter n
corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read
, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.
This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:
import io
from lxml import etree
import xml.etree.ElementTree as etree2
xml_string = """<root>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>"""
#### lxml output
for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(element)}\n")
### xml.etree.ElementTree output is the same
for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree2.tostring(element)}\n")
Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read
has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse
).
I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline
that would have to be made for ~8GB file cause a bottleneck / IO throttle?
### for the MWE
class StreamString(object):
def __init__(self, string):
self._io = io.StringIO(string)
def read(self, len=None):
return self._io.readline().encode("UTF-8")
def close(self):
self._io.close()
### closer to what would be used in practice
class StreamFile(object):
def __init__(self, path):
self._file = open(path, "r")
def read(self, len=None):
return self._file.readline().encode("UTF-8")
def close(self):
self._file.close()
### demonstrating the expected line-by-line parsing behavior
iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag,
root.text.strip() if root.text is not None else root.text,
root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
for event, element in iterator:
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.
Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)
Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?
If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml
and the Python built-in xml.etree.ElementTree
?
I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.
* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.
What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.
The way iterparse
(both from lxml.etree
and xml.etree.ElementTree
) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).
But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read
method of that file-like object. (As opposed to the output of .readline
as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.
.read
seems to have an optional parameter n
corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read
, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.
This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:
import io
from lxml import etree
import xml.etree.ElementTree as etree2
xml_string = """<root>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>"""
#### lxml output
for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(element)}\n")
### xml.etree.ElementTree output is the same
for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree2.tostring(element)}\n")
Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read
has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse
).
I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline
that would have to be made for ~8GB file cause a bottleneck / IO throttle?
### for the MWE
class StreamString(object):
def __init__(self, string):
self._io = io.StringIO(string)
def read(self, len=None):
return self._io.readline().encode("UTF-8")
def close(self):
self._io.close()
### closer to what would be used in practice
class StreamFile(object):
def __init__(self, path):
self._file = open(path, "r")
def read(self, len=None):
return self._file.readline().encode("UTF-8")
def close(self):
self._file.close()
### demonstrating the expected line-by-line parsing behavior
iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag,
root.text.strip() if root.text is not None else root.text,
root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
for event, element in iterator:
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.
Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)
Event-based parsers, unlike DOM parsers, do not have to build an in-memory representation of the parsed data and therefore are not limited to documents that can fit in memory. Furthermore, "line-by-line parsing" of an XML file makes no sense as XML is not a line-oriented format. Parsing XML is a long-solved problem. It's better to understand the fully capable existing parsing solutions than to reinvent them poorly.
Realize that processing events via callbacks such as startElement() require no more state creation than what you or your requirements impose. If you attempt to retrieve the contents of the root element as a string, of course you risk having insufficient memory. Don't do that; it's fighting the event framework rather than working naturally within it.
xml_string
is small enough to easily fit in a reasonable chunk size. – user2357112 Commented Jan 13 at 0:42