Ticket #12 (closed enhancement: fixed)

Opened 4 years ago

Last modified 3 years ago

PyYAML is slow

Reported by: edemaine@mit.edu Assigned to: xi
Priority: normal Component: pyyaml
Severity: normal Keywords:
Cc:

Description

Here are two simple wall-clock timings comparing PyYAML to PySyck on a Pentium 4 2.8GHz with 1MB cache and 1GB RAM:

$ wc file1.yaml
 2036  8767 59154 file1
$ test.py file1.yaml
0:00:00.001419 to read the YAML via Syck
0:00:04.029627 to read the YAML via PyYAML
$ wc file2.yaml
  8949  35105 317342 file2
$ test.py file2.yaml
0:00:00.001564 to read the YAML via Syck
0:00:19.288912 to read the YAML via PyYAML

I do not expect PyYAML to be terribly competitive with Syck: the language barrier is big, and PyYAML is written with a higher level of abstraction. But I was surprised to see a factor of 12,000 difference. I wonder if a bit of profiling and tuning might reduce this gap to just a couple of orders of magnitude (100x) instead of four? Personally, 19 seconds to read a 0.3 meg file is too slow for my application, so I'll have to switch back to Syck for now, unfortunately. Just food for thought...

Attachments

test.py (340 bytes) - added by edemaine@mit.edu on 05/08/06 17:45:32.
A simple Syck vs. PyYAML driver
CSAIL.yaml (246.5 kB) - added by edemaine@mit.edu on 05/08/06 17:46:44.
A large YAML file (slightly culled to fit on Trac)
test.2.py (0.8 kB) - added by edemaine@mit.edu on 08/30/06 16:48:01.
New performance test script
test.3.py (0.8 kB) - added by edemaine@mit.edu on 08/30/06 17:29:56.
Corrected test script

Change History

05/08/06 15:10:59 changed by xi

  • status changed from new to assigned.

It is expected for C vs Python, but I'm too surpised by the factor of the difference. I usually get about 200x difference on simple tests. You may attach your files and the script so I can check them.

You may try to use psyco, it might get you about 1.5-5.0 speed up:

>>> from yaml.reader import Reader
>>> from yaml.scanner import Scanner
>>> from yaml.parser import Parser
>>> from yaml.composer import Composer
>>> from yaml.constructor import Constructor
>>> from psyco import bind
>>> bind(Reader)
>>> bind(Scanner)
>>> bind(Parser)
>>> bind(Composer)
>>> bind(Constructor)

The real solution is, of course, to rewrite the code to C. It's planned, but don't expect it too soon.

05/08/06 17:44:43 changed by edemaine@mit.edu

OK, here is a sample file on the larger size (8961 lines, 301,229 bytes), and a simple driver script generating output similar to the last example above.

05/08/06 17:45:32 changed by edemaine@mit.edu

  • attachment test.py added.

A simple Syck vs. PyYAML driver

05/08/06 17:46:44 changed by edemaine@mit.edu

  • attachment CSAIL.yaml added.

A large YAML file (slightly culled to fit on Trac)

05/25/06 05:06:46 changed by xi

Sorry for the trac spam :(. I'll try to deal with it somehow.

On the bright side, I've started the LibYAML project, which will eventually allow to close this bug. :)

08/13/06 09:51:45 changed by xi

  • status changed from assigned to closed.
  • resolution set to fixed.

The libyaml bindings are now usable (though not as fast as possible).

08/30/06 16:47:12 changed by edemaine@mit.edu

I finally got to try the LibYAML bindings of PyYAML. In case you're curious, here is a repeat of the simple test from before. The improvement so far is about a factor of 10 (without Psyco), but still 3 more orders of magnitude to get down to Syck speed.

$ python test.py CSAIL.ycard
0:00:00.001437 to read the YAML via Syck
0:00:13.661756 to read the YAML via PyYAML
0:00:01.181506 to read the YAML via PyYAML/LibYAML

08/30/06 16:48:01 changed by edemaine@mit.edu

  • attachment test.2.py added.

New performance test script

08/30/06 17:20:05 changed by xi

There is a problem in your test code in the line:

  cards = syck.load_documents (open (sys.argv[1]))

The function load_documents is a generator, so it does not really load the documents. You should replace it with

  for card in syck.load_documents (open (sys.argv[1])):
      pass

Please post the updated benchmarks :) PyYAML/LibYAML is 2-3 times slower than PySyck, probably because of Pyrex and PyYAML code overhead. I'm going to reduce overhead by replacing all Pyrex and some Python code with pure C.

You may also run

  yaml.CLoader (open (sys.argv[1])).raw_parse()

to check pure LibYAML perfomance.

08/30/06 17:29:38 changed by edemaine@mit.edu

Whoops, you are right! Sorry about that. Now they are within a factor of 2 as you state (I am actually using PySyck):

$ python test.py CSAIL.ycard
0:00:00.643884 to read the YAML via Syck
0:00:13.676710 to read the YAML via PyYAML
0:00:01.201301 to read the YAML via PyYAML/LibYAML

Nice work! Looking forward to even more optimizations.

08/30/06 17:29:56 changed by edemaine@mit.edu

  • attachment test.3.py added.

Corrected test script


Add/Change #12 (PyYAML is slow)




Change Properties
Action