Bazaar

Merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522 into lp:bzr

remove_monkey_patched_elementtree_escaping_614522
Merge into bzr.dev

Proposed by Martin Packman on 2010-08-29

Status:

Merged

Approved by:

John A Meinel on 2010-08-30

Approved revision:

no longer in the source branch.

Merged at revision:

5407

Proposed branch:

lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522

Merge into:

lp:bzr

Diff against target:

85 lines (+2/-67)

1 file modified

bzrlib/xml_serializer.py (+2/-67)

To merge this branch:

bzr merge lp:~gz/bzr/remove_monkey_patched_elementtree_escaping_614522

High

Fix Released

Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2010-08-29	Approve on 2010-08-30
Review via email: mp+34028@code.launchpad.net

Commit message

Remove monkey patching for elementtree, we do it differently anyway

Description of the change

The monkey patching of some ElementTree escaping functions for performance purposes mangles the output with newer versions of the library.

As this kind of thing is a bad idea and it's unclear to me what improvement this provides, this branch just removes all of that.

I expect people using non-ascii characters with this code on Python 2.7 currently risk corrupting their branch or something.

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2010-08-29:

Do we still use XML anywhere other than pre-2a formats?

Revision history for this message

Jelmer Vernooij (jelmer) wrote on 2010-08-29:

(If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

Revision history for this message

John A Meinel (jameinel) wrote on 2010-08-29:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/29/2010 1:51 PM, Jelmer Vernooij wrote:
> (If the optimization is only relevant for pre-2a formats then it would be an easier choice to get rid of an optimization).

we only use xml for bundles in 2a. So not everyday, but it does get used
from time to time.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx6ueAACgkQJdeBCYSNAAN6qgCgqm1wOHDEK9lMcbGyVYcvbK+O
hoIAmwfDioXiaVeO4di3Ia2xHHyiJfrB
=uqPs
-----END PGP SIGNATURE-----

Revision history for this message

Martin Packman (gz) wrote on 2010-08-30:

Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:

502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
+2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)

So, what command can I run instead to measure how much ripping this code out hurts performance?

Revision history for this message

John A Meinel (jameinel) wrote on 2010-08-30:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
>
> 502 0 205.1471 16.3189 <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
> +2610268 0 13.1339 8.5832 +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
>
> So, what command can I run instead to measure how much ripping this code out hurts performance?

I would guess a major factor would be "which version of Elementtree" :)
since it isn't bundled with earlier pythons.

As near as I can tell, the main change is to switch:

            text = replace(text, "&", "&")
            text = replace(text, "'", "'") # FIXME: overkill
            text = replace(text, "\"", """)
            text = replace(text, "<", "<")
            text = replace(text, ">", ">")

to using:
escape_re = re.compile("[&'\"<>]")
escape_map = {
    "&":'&',
    "'":"'", # FIXME: overkill
    "\"":""",
    "<":"<",
    ">":">",
    }
def _escape_replace(match, map=escape_map):
    return map[match.group()]
...
  text = escape_re.sub(_escape_replace, text)

As such, I think a valid benchmark would be:

a) grab a large inventory content from pre-2a format (1.9-rich-root,
for example). This can be a single revision
b) Time the different between a single re.sub() versus 5 calls to
'string.replace'.

Anyway, as mentioned, this isn't a large perf issue for current formats,
so we probably can just revert it.

And your profiling shows... we worked around the ElementTree code
entirely in later revisions, so again, it is likely to not degrade
performance by applying your patch.

merge: approve

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx747MACgkQJdeBCYSNAAPIwwCgykHabjXO7EuELNwHqurUY+Pc
vjUAoJ/5kqD10G1hjvTC0SRz418Kpi1I
=Sm4E
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/30/2010 7:44 AM, Martin [gz] wrote:
> Okay, I'm confused now. Every microbenchmark I could cook up the bzrlib version of the _escape_cdata is actually slower than the original. So, I tried profiling bundle, and sure enough there's an xml escaping function high up in the list, but it's not the etree one:
> 
>          502            0    205.1471     16.3189   <C:\Python24\Lib\site-packages\bzrlib\xml8.py>:217(write_inventory)
>     +2610268            0     13.1339      8.5832   +<C:\Python24\Lib\site-packages\bzrlib\xml8.py>:94(_encode_and_escape)
> 
> So, what command can I run instead to measure how much ripping this code out hurts performance?

I would guess a major factor would be "which version of Elementtree" :)
since it isn't bundled with earlier pythons.

As near as I can tell, the main change is to switch:

text = replace(text, "&", "&amp;")
            text = replace(text, "'", "&apos;") # FIXME: overkill
            text = replace(text, "\"", "&quot;")
            text = replace(text, "<", "&lt;")
            text = replace(text, ">", "&gt;")

to using:
escape_re = re.compile("[&'\"<>]")
escape_map = {
    "&":'&amp;',
    "'":"&apos;", # FIXME: overkill
    "\"":"&quot;",
    "<":"&lt;",
    ">":"&gt;",
    }
def _escape_replace(match, map=escape_map):
    return map[match.group()]
...
  text = escape_re.sub(_escape_replace, text)

As such, I think a valid benchmark would be:

a) grab a large inventory content from pre-2a format (1.9-rich-root,
    for example). This can be a single revision
 b) Time the different between a single re.sub() versus 5 calls to
    'string.replace'.

Anyway, as mentioned, this isn't a large perf issue for current formats,
so we probably can just revert it.

And your profiling shows... we worked around the ElementTree code
entirely in later revisions, so again, it is likely to not degrade
performance by applying your patch.

merge: approve

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx747MACgkQJdeBCYSNAAPIwwCgykHabjXO7EuELNwHqurUY+Pc
vjUAoJ/5kqD10G1hjvTC0SRz418Kpi1I
=Sm4E
-----END PGP SIGNATURE-----

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2010-08-30:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Note also that when we performance tuned it, we may have been running
under lsprof, which would also penalize the extra function calls more
than real runtime would (IME).

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkx7/6EACgkQJdeBCYSNAAPrBQCgjwcTXwZnqbOdgs2iCB10qY8A
JnYAnjIFNiwViT9sdOAxodeZcVpV9TbC
=gdz3
-----END PGP SIGNATURE-----

Revision history for this message

John A Meinel (jameinel) wrote on 2010-09-02:

sent to pqm by email

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

Karl Bielefeldt

Mahmoud Hassan

Martin Packman

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

 === modified file 'bzrlib/xml_serializer.py'
 --- bzrlib/xml_serializer.py	2010-05-25 17:27:52 +0000
 +++ bzrlib/xml_serializer.py	2010-08-29 18:47:43 +0000
@@ -22,6 +22,8 @@
  # importing this module is fairly slow because it has to load several
  # ElementTree bits
++import re
++
  from bzrlib.serializer import Serializer
  from bzrlib.trace import mutter
@@ -111,73 +113,6 @@
          return ElementTree().parse(f)
--# performance tuning for elementree's serialiser. This should be
--# sent upstream - RBC 20060523.
--# the functions here are patched into elementtree at runtime.
--import re
--escape_re = re.compile("[&'\"<>]")
--escape_map = {
--    "&":'&amp;',
--    "'":"&apos;", # FIXME: overkill
--    "\"":"&quot;",
--    "<":"&lt;",
--    ">":"&gt;",
--    }
--def _escape_replace(match, map=escape_map):
--    return map[match.group()]
--
--def _escape_attrib(text, encoding=None, replace=None):
--    # escape attribute value
--    try:
--        if encoding:
--            try:
--                text = elementtree.ElementTree._encode(text, encoding)
--            except UnicodeError:
--                return elementtree.ElementTree._encode_entity(text)
--        if replace is None:
--            return escape_re.sub(_escape_replace, text)
--        else:
--            text = replace(text, "&", "&amp;")
--            text = replace(text, "'", "&apos;") # FIXME: overkill
--            text = replace(text, "\"", "&quot;")
--            text = replace(text, "<", "&lt;")
--            text = replace(text, ">", "&gt;")
--            return text
--    except (TypeError, AttributeError):
--        elementtree.ElementTree._raise_serialization_error(text)
--
--elementtree.ElementTree._escape_attrib = _escape_attrib
--
--escape_cdata_re = re.compile("[&<>]")
--escape_cdata_map = {
--    "&":'&amp;',
--    "<":"&lt;",
--    ">":"&gt;",
--    }
--def _escape_cdata_replace(match, map=escape_cdata_map):
--    return map[match.group()]
--
--def _escape_cdata(text, encoding=None, replace=None):
--    # escape character data
--    try:
--        if encoding:
--            try:
--                text = elementtree.ElementTree._encode(text, encoding)
--            except UnicodeError:
--                return elementtree.ElementTree._encode_entity(text)
--        if replace is None:
--            return escape_cdata_re.sub(_escape_cdata_replace, text)
--        else:
--            text = replace(text, "&", "&amp;")
--            text = replace(text, "<", "&lt;")
--            text = replace(text, ">", "&gt;")
--            return text
--    except (TypeError, AttributeError):
--        elementtree.ElementTree._raise_serialization_error(text)
--
--elementtree.ElementTree._escape_cdata = _escape_cdata
--
--
  def escape_invalid_chars(message):
      """Escape the XML-invalid characters in a commit message.