From planrichi at gmail.com Wed Mar 2 08:27:56 2016 From: planrichi at gmail.com (Richard Plangger) Date: Wed, 2 Mar 2016 14:27:56 +0100 Subject: [pypy-dev] GSoC 2016 Message-ID: <56D6EA5C.4050408@gmail.com> Hi, I was wondering who applied as a sub org to python last year? The registration for new sub orgs is open until March 7th. (https://wiki.python.org/moin/SummerOfCode/2016#Sub-orgs) As we discussed on the sprint I will try to attract some students tomorrow at the university in Vienna. Of course I'm also willing to mentor if there is a good proposal. Cheers, Richard -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From fijall at gmail.com Wed Mar 2 09:09:00 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 2 Mar 2016 15:09:00 +0100 Subject: [pypy-dev] GSoC 2016 In-Reply-To: <56D6EA5C.4050408@gmail.com> References: <56D6EA5C.4050408@gmail.com> Message-ID: Hi Richard As discussed on the sprint I applied (but have yet to receive a confirmation) On Wed, Mar 2, 2016 at 2:27 PM, Richard Plangger wrote: > Hi, > > I was wondering who applied as a sub org to python last year? > The registration for new sub orgs is open until March 7th. > (https://wiki.python.org/moin/SummerOfCode/2016#Sub-orgs) > > As we discussed on the sprint I will try to attract some students > tomorrow at the university in Vienna. Of course I'm also willing to > mentor if there is a good proposal. > > Cheers, > Richard > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From edd at theunixzoo.co.uk Thu Mar 3 12:16:10 2016 From: edd at theunixzoo.co.uk (Edd Barrett) Date: Thu, 3 Mar 2016 17:16:10 +0000 Subject: [pypy-dev] CFP: ICOOOLPS'16: Workshop on Implementation, Compilation, Optimization of OO Languages, Programs and Systems Message-ID: <20160303171610.GG14101@wilfred.home> May be of interest to some of the members on this list: Call for Papers: ICOOOLPS?16 ============================ 11th Workshop on Implementation, Compilation, Optimization of OO Languages, Programs and Systems Co-located with ECOOP July 18, 2016, Rome, Italy URL: http://2016.ecoop.org/track/ICOOOLPS-2016 Twitter: @ICOOOLPS The ICOOOLPS workshop series brings together researchers and practitioners working in the field of language implementation and optimization. The goal of the workshop is to discuss emerging problems and research directions as well as new solutions to classic performance challenges. The topics of interest for the workshop include techniques for the implementation and optimization of a wide range of languages including but not limited to object-oriented ones. Furthermore, meta-compilation techniques or language-agnostic approaches are welcome, too. A non-exclusive list of topics follows: - implementation and optimization of fundamental languages features (from automatic memory management to zero-overhead metaprogramming) - runtime systems technology (libraries, virtual machines) - static, adaptive, and speculative optimizations and compiler techniques - meta-compilation techniques and language-agnostic approaches for the efficient implementation of languages - compilers (intermediate representations, offline and online optimizations,...) - empirical studies on language usage, benchmark design, and benchmarking methodology - resource-sensitive systems (real-time, low power, mobile, cloud) - studies on design choices and tradeoffs (dynamic vs. static compilation, heuristics vs. programmer input,...) - tooling support, debuggability and observability of languages as well as their implementations ### Workshop Format and Submissions This workshop welcomes the presentation and discussion of new ideas and emerging problems that give a chance for interaction and exchange. More mature work is welcome as part of a mini-conference format, too. We aim to interleave interactive brainstorming and demonstration sessions between the formal presentations to foster an active exchange of ideas. The workshop papers will be published either in the ACM DL or in the Dagstuhl LIPIcs ECOOP Workshop proceedings. Until further notice, please use the ACM SIGPLAN template with a 10pt font size: http://www.sigplan.org/Resources/Author/ - position and work-in-progress paper: 1-4 pages - technical paper: max. 10 pages - demos and posters: 1-page abstract For the submission, please use the HotCRP system: http://ssw.jku.at/icooolps/ ### Important Dates - abstract submission: April 11, 2016 - paper submission: April 15, 2016 - notification: May 13, 2016 - all deadlines: Anywhere on Earth (AoE), i.e., GMT/UTC?12:00 hour - workshop: July 18th, 2016 ### Program Committee Edd Barrett, King?s College London, UK Clement Bera, Inria Lille, France Maxime Chevalier-Boisvert, Universit? de Montr?al, Canada Tim Felgentreff, Hasso Plattner Institute, Germany Roland Ducournau, LIRMM, Universit? de Montpellier, France Elisa Gonzalez Boix, Vrije Universiteit Brussel, Belgium David Gregg, Trinity College Dublin, Ireland Matthias Grimmer, Johannes Kepler University Linz, Austria Michael Haupt, Oracle, Germany Richard Jones, University of Kent, UK Tomas Kalibera, Northeastern University, USA Hidehiko Masuhara, Tokyo Institute of Technology, Japan Tiark Rompf, Purdue University, USA Jennifer B. Sartor, Ghent University, Belgium Sam Tobin-Hochstadt, Indiana University, USA ### Workshop Organizers Stefan Marr, Johannes Kepler University Linz, Austria Eric Jul, University of Oslo, Norway For questions or concerns, please mail to stefan.marr at jku.at or contact us via https://twitter.com/icooolps. -- Best Regards Edd Barrett http://www.theunixzoo.co.uk From piotr.jerzy.jurkiewicz at gmail.com Fri Mar 4 19:48:41 2016 From: piotr.jerzy.jurkiewicz at gmail.com (Piotr Jurkiewicz) Date: Sat, 5 Mar 2016 01:48:41 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage Message-ID: <56DA2CE9.5070409@gmail.com> Hi PyPy devs, my name is Piotr Jurkiewicz and I am a first-year PhD student at the AGH University of Science and Technology, Krak?w, Poland. I am writing this email to make sure that PyPy is going to participate in GSoC 2016, since I am interested in one of the proposed projects: Optimized Unicode Representation Below is a list of my ideas and plan for the project. (I use Python 2 nomenclature, that is unicode strings are `unicode` objects and bytes strings are `str` objects.) 1. Store all unicode objects contents internally as UTF-8. This would reduce size of stored contents and allow external libraries, which expect UTF-8, to process contents directly in the memory (for example using various regexp libraries to search unicode string). 2. Unify interning caches for str and unicode. This would allow unicode objects and corresponding utf8-encoded-str objects to share the same interned buffer. For example unicode object u'ko?' would share interned buffer with str 'ko\xc5\x84'. This would make unicode.encode('utf-8') basically no op. As UTF-8 becomes dominant encoding for any data exchange, including web (86%) [1], more and more data coming out from Python scripts needs to be UTF-8 encoded. Therefore, it is important to make this operation as cheap as possible. It would speed up str.decode('utf-8') significantly too, although it wouldn't make it no op. String still would need to be checked if it is a correct UTF-8 string when transforming to unicode object. But we can get rid of additional allocation, copying string contents and storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE. 3. Indexing of codepoints positions, what would allow O(1) random access and slicing. The idea is simple: alongside contents of each interned unicode object, store an array of unsigned integers. These integers will be positions (in bytes), counting from the beginning of the buffer, at which each next 64-codepoint-long 'pages' start. Random access would be as follows: page_num, byte_in_page = divmod(codepoint_pos, 64) page_start_byte = index[page_num] exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) return buffer[exact_byte] Using 64-byte long pages, like in the example above, would allow O(1) random access, with constant terms of: - one cache access in cases of only-ASCII texts (indexes for such unicode objects will not be created and maintained) - three cache accesses in cases of texts consisting of ASCII mixed with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic alphabets) - four or five cache accesses in cases of texts consisting mostly of three- and four- byte characters (all above assuming 64-byte long CPU cache lines) Memory overhead associated with storing index array would be in range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32 codepoints will be allowed) (assuming that the index array consists of integers of smallest possible type which can store buffer_bytes_len - 1) 4. Fast codepoints counting/seeking with branchless algorithm [2]. When unicode object is interned, we are sure that it is a correct UTF-8 string. Therefore, there is no need for correctness checking when seeking, so a branchless algorithm can be used. [1]: http://w3techs.com/technologies/details/en-utf8/all/all [2]: http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html All of these changes can be introduced one at a time, what would improve tracking of performance changes and debugging of eventual errors. After completing the project I plan to write a paper describing speedup method of random access unicode access based on indexing, as this method has a potential for being used in other language interpreters which have immutable and/or interned unicode strings. Note that similar index can be created for graphemes as well, so this method can be used in languages which provide grapheme-based interface (like Perl 6). Please share your thoughts about these ideas. Cheers, Piotr From arigo at tunes.org Sat Mar 5 03:09:59 2016 From: arigo at tunes.org (Armin Rigo) Date: Sat, 5 Mar 2016 09:09:59 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DA2CE9.5070409@gmail.com> References: <56DA2CE9.5070409@gmail.com> Message-ID: Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal! On 5 March 2016 at 01:48, Piotr Jurkiewicz wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bient?t, Armin. From matti.picus at gmail.com Sat Mar 5 16:17:47 2016 From: matti.picus at gmail.com (Matti Picus) Date: Sat, 5 Mar 2016 23:17:47 +0200 Subject: [pypy-dev] Release 5.0.0 Message-ID: <56DB4CFB.6020007@gmail.com> Pre-release bundles are up on the buildbot, http://buildbot.pypy.org/nightly/release-5.x please test them out. There are still a few last touches pending, but it would be nice to have some preliminary indication whether the bundles work in real-life work loads and bugs that we claim we fixed since 4.0.1 actually do not reappear. Also the release notice is up at https://bitbucket.org/pypy/pypy/src/default/pypy/doc/release-5.0.0.rst Any help with it would be appreciated Matti From yury at shurup.com Sat Mar 5 16:33:31 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Sat, 5 Mar 2016 22:33:31 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: <56DB4CFB.6020007@gmail.com> References: <56DB4CFB.6020007@gmail.com> Message-ID: On Sat, 5 Mar 2016, Matti Picus wrote: > Pre-release bundles are up on the buildbot, > http://buildbot.pypy.org/nightly/release-5.x please test them out. Hi Matti, So did you figure out the mysterious memory consumption issues that we have experienced while trying to upgrade the Windows builder to a more recent version of PyPy? Do you think it would make sense to retry the upgrade after PyPy 5.0.0 is out? -- Sincerely yours, Yury V. Zaytsev From tinchester at gmail.com Sat Mar 5 17:20:49 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=c4=87?=) Date: Sat, 5 Mar 2016 23:20:49 +0100 Subject: [pypy-dev] Making Pyrasite work with PyPy Message-ID: <56DB5BC1.90601@gmail.com> Hello, in case you haven't heard of it, Pyrasite (https://github.com/lmacken/pyrasite) is a tool for injecting code into running Python processes. Personally I have found it invaluable for forensics on services running in production and have successfully solved memory leaks, connection leaks and deadlocks with it. One of the payloads provided will open a remote REPL right in a running process, without the process having *any* preparation logic in it. I think this is extremely powerful and makes Python catch and up even surpass Java (which has automatic stack trace dumping on SIGQUIT and useful tools like JConsole and VisualVM that can connect to running processes, again by default with no setup in the process) for these kinds of things. Anyway, Pyrasite uses gdb under the hood; gdb will attach to a running process and inject the following: gdb_cmds = [ 'PyGILState_Ensure()', 'PyRun_SimpleString("' 'import sys; sys.path.insert(0, \\"%s\\"); ' 'sys.path.insert(0, \\"%s\\"); ' 'exec(open(\\"%s\\").read())")' % (os.path.dirname(filename), os.path.abspath(os.path.join(os.path.dirname(__file__), '..')), filename), 'PyGILState_Release($1)', ] If I change the Py* functions to PyPy* (PyRun_SimpleString to PyPyRun_SimpleString), this seems to work just fine on PyPy too. This is great, and now I'd like to contribute back to Pyrasite and get PyPy support in there. It'd be great if Pyrasite could automatically detect if the underlying process is CPython or PyPy, so since my experience working on the C level is very basic, I'm asking you, the PyPy devs, if there's a good way of detecting a process is PyPy given its PID and gdb's ability of attaching to a process and doing gdb things. Worst case scenario, gdb supports "info functions", which is how I found the PyPy functions in the first place, but is there a better way? I apologize if this is off-topic for PyPy-dev. From fijall at gmail.com Sun Mar 6 02:03:32 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 09:03:32 +0200 Subject: [pypy-dev] Making Pyrasite work with PyPy In-Reply-To: <56DB5BC1.90601@gmail.com> References: <56DB5BC1.90601@gmail.com> Message-ID: Hi Tin This is very much on topic for pypy-dev. One obvious solution would be to check for the existance of symbols in gdb (if there is a symbol called PyPyRun_SimpleString, then obviously you're running on PyPy). I'm not sure how to express it under gdb, but there must be a way On Sun, Mar 6, 2016 at 12:20 AM, Tin Tvrtkovi? wrote: > Hello, > > in case you haven't heard of it, Pyrasite > (https://github.com/lmacken/pyrasite) is a tool for injecting code into > running Python processes. Personally I have found it invaluable for > forensics on services running in production and have successfully solved > memory leaks, connection leaks and deadlocks with it. One of the > payloads provided will open a remote REPL right in a running process, > without the process having *any* preparation logic in it. I think this > is extremely powerful and makes Python catch and up even surpass Java > (which has automatic stack trace dumping on SIGQUIT and useful tools > like JConsole and VisualVM that can connect to running processes, again > by default with no setup in the process) for these kinds of things. > > Anyway, Pyrasite uses gdb under the hood; gdb will attach to a running > process and inject the following: > > gdb_cmds = [ > 'PyGILState_Ensure()', > 'PyRun_SimpleString("' > 'import sys; sys.path.insert(0, \\"%s\\"); ' > 'sys.path.insert(0, \\"%s\\"); ' > 'exec(open(\\"%s\\").read())")' % > (os.path.dirname(filename), > os.path.abspath(os.path.join(os.path.dirname(__file__), > '..')), > filename), > 'PyGILState_Release($1)', > ] > > If I change the Py* functions to PyPy* (PyRun_SimpleString to > PyPyRun_SimpleString), this seems to work just fine on PyPy too. > > This is great, and now I'd like to contribute back to Pyrasite and get > PyPy support in there. It'd be great if Pyrasite could automatically > detect if the underlying process is CPython or PyPy, so since my > experience working on the C level is very basic, I'm asking you, the > PyPy devs, if there's a good way of detecting a process is PyPy given > its PID and gdb's ability of attaching to a process and doing gdb > things. Worst case scenario, gdb supports "info functions", which is how > I found the PyPy functions in the first place, but is there a better way? > > I apologize if this is off-topic for PyPy-dev. > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From yury at shurup.com Sun Mar 6 05:04:10 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Sun, 6 Mar 2016 11:04:10 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> Message-ID: On Sat, 5 Mar 2016, Yury V. Zaytsev wrote: > On Sat, 5 Mar 2016, Matti Picus wrote: > > So did you figure out the mysterious memory consumption issues that we have > experienced while trying to upgrade the Windows builder to a more recent > version of PyPy? Do you think it would make sense to retry the upgrade after > PyPy 5.0.0 is out? So, it looks like with PyPy 5.0.0 the problem is exactly the same as with the previous version. The translation goes through (and possibily faster / uses less memory, I didn't check), but the compilation bails out with a `MemoryError` at `buffer.append(fh.read())`: http://buildbot.pypy.org/builders/pypy-c-jit-win-x86-32/builds/2266/steps/translate/logs/stdio That's definitively not my fault, I've done my `editbin /largeaddressaware` dance and confirmed its effects with `dumpbin /headers`. In the mean time, I rolled back to PyPy 2.5.1 on the build slave. Oh wait, I meant to say build follower. Sorry about this. -- Sincerely yours, Yury V. Zaytsev From matti.picus at gmail.com Sun Mar 6 16:01:30 2016 From: matti.picus at gmail.com (Matti Picus) Date: Sun, 6 Mar 2016 23:01:30 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> Message-ID: <56DC9AAA.1080003@gmail.com> An HTML attachment was scrubbed... URL: From fijall at gmail.com Sun Mar 6 16:18:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 23:18:23 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: <56DC9AAA.1080003@gmail.com> References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: It uses subprocess, but you need to quit pypy (so run this with --source and then make separately) for memory to be reclaimed On Sun, Mar 6, 2016 at 11:01 PM, Matti Picus wrote: > > > On 06/03/16 12:04, Yury V. Zaytsev wrote: > > On Sat, 5 Mar 2016, Yury V. Zaytsev wrote: > > So, it looks like with PyPy 5.0.0 the problem is exactly the same as with > the previous version. The translation goes through (and possibily faster / > uses less memory, I didn't check), but the compilation bails out with a > `MemoryError` at `buffer.append(fh.read())`: > > http://buildbot.pypy.org/builders/pypy-c-jit-win-x86-32/builds/2266/steps/translate/logs/stdio > > In the mean time, I rolled back to PyPy 2.5.1 on the build slave. Oh wait, I > meant to say build follower. Sorry about this. > > I watched the compile part of translation in a system monitor on a local VM. > Using the pypy 5.0 release, during compilation there is a single pypy.exe > process requiring about 2.8GB of memory. At some point, toward the end of > compiling the 1000+ source files (perhaps during link?) memory consumption > jumps way up, trying to access at least another GB of memory, at which point > the virtual machine complains and the pypy.exe crashes. Any ideas? I thought > the compile step uses multiprocessing to run in a seperate process, but it > seems I am wrong. > Matti > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From hubo at jiedaibao.com Mon Mar 7 02:58:14 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 15:58:14 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> Message-ID: <56DD3493.8020800@jiedaibao.com> I think it is not reasonable to use UTF-8 to represent the unicode string type. 1. Less storage - this is not always true. It is only true for strings with a lot of ASCII characters. In Asia, most strings in local languages (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more storage than in UTF-16. To make things worse, while it always consumes 2*N bytes for a N-characters string in UTF-16, it is difficult to estimate the size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) (UTF-16 also has two-word characters, but len() reports 2 for these characters, I think it is not harmful to treat them as two characters) 2. There would be very complicate logics for size calculating and slicing. For UTF-16, every character is represented with a 16-bit integer, so it is convient for size calculating and slicing. But character in UTF-8 consumes variant bytes, so either we call mb_* string functions instead (which is slow in nature) or we use special logic like storing indices of characters in another array (which introduces cost for extra addressings). 3. When displaying with repr(), non-ASCII characters are displayed with \uXXXX format. If the internal storage for unicode is UTF-8, the only way to be compatible with this format is to convert it back to UTF-16. It may be wiser to let programmers deside which encoding they would like to use. If they want to process UTF-8 strings without performance cost on converting, they should use "bytes". When correct size calculating and slicing of non-ASCII characters are concerned it may be better to use "unicode". 2016-03-07 hubo ????Armin Rigo ?????2016-03-05 16:09 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"Piotr Jurkiewicz" ???"PyPy Developer Mailing List" Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal! On 5 March 2016 at 01:48, Piotr Jurkiewicz wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bient?t, Armin. _______________________________________________ pypy-dev mailing list pypy-dev at python.org https://mail.python.org/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 7 03:46:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 10:46:23 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD3493.8020800@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: Hi hubo. I think you're slightly confusing two things. UTF-16 is a variable-length encoding that has two-word characters that *has to* return "1" for len() of those. UCS-2 seems closer to what you described (which is a fixed-width encoding), but can't encode all the unicode characters and as such is unsuitable for a modern unicode representation. I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the slicing and size calculations still has to be as complicated as for UTF-8. Complicated logic in repr() - those are not usually performance critical parts of your program and it's ok to have some complications there. It's true that UTF-16 can be less efficient than UTF-8 for certain languages, however both are more memory efficient than what we currently use (UCS4). There are however some problems - even if you work exclusively in, say, korean, for example web servers still have to deal with some parts that are ascii (html markup, css etc.) while handling text in korean. In those cases UTF8 vs UTF16 is more muddled and the exact details depend a lot. We also need to consider the fact that we ship one canonical PyPy to everybody - people using different languages and different encodings. Overall, UTF8 seems like definitely a better alternative than UCS4 (also for asian languages), which is what we are using now and I would be inclined to leave UTF16 as an option to see if it performs better for certain benchmarks. Best regards, Maciej Fijalkowski On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode string > type. > > > 1. Less storage - this is not always true. It is only true for strings with > a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more > storage than in UTF-16. To make things worse, while it always consumes 2*N > bytes for a N-characters string in UTF-16, it is difficult to estimate the > size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) > > 2. There would be very complicate logics for size calculating and slicing. > For UTF-16, every character is represented with a 16-bit integer, so it is > convient for size calculating and slicing. But character in UTF-8 consumes > variant bytes, so either we call mb_* string functions instead (which is > slow in nature) or we use special logic like storing indices of characters > in another array (which introduces cost for extra addressings). > > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only way to > be compatible with this format is to convert it back to UTF-16. > > It may be wiser to let programmers deside which encoding they would like to > use. If they want to process UTF-8 strings without performance cost on > converting, they should use "bytes". When correct size calculating and > slicing of non-ASCII characters are concerned it may be better to use > "unicode". > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Armin Rigo > ?????2016-03-05 16:09 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"Piotr Jurkiewicz" > ???"PyPy Developer Mailing List" > > Hi Piotr, > > Thanks for giving some serious thoughts to the utf8-stored unicode > string proposal! > > On 5 March 2016 at 01:48, Piotr Jurkiewicz > wrote: >> Random access would be as follows: >> >> page_num, byte_in_page = divmod(codepoint_pos, 64) >> page_start_byte = index[page_num] >> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >> return buffer[exact_byte] > > This is the part I'm least sure about: seek_forward() needs to be a > loop over 0 to 63 codepoints. True, each loop can be branchless, and > very short---let's say 4 instructions. But it still makes a total of > up to 252 instructions (plus the checks to know if we must go on). > These instructions are all or almost all dependent on the previous > one: you must have finished computing the length of one sequence to > even being computing the length of the next one. Maybe it's faster to > use a more "XMM-izable" algorithm which counts 0 for each byte in > 0x80-0xBF and 1 otherwise, and makes the sum. > > There are also variants, e.g. adding a second array of words similar > to 'index', but where each word is 8 packed bytes giving 8 starting > points inside the page (each in range 0-252). This would reduce the > walk to 0-7 codepoints. > > I'm +1 on your proposal. The whole thing is definitely worth a try. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From cfbolz at gmx.de Mon Mar 7 03:48:53 2016 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Mon, 7 Mar 2016 09:48:53 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD3493.8020800@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: <56DD4075.6090201@gmx.de> Hi, On 07/03/16 08:58, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode > string type. > 1. Less storage - this is not always true. It is only true for strings > with a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume > more storage than in UTF-16. To make things worse, while it always > consumes 2*N bytes for a N-characters string in UTF-16, it is difficult > to estimate the size of a N-characters string in UTF-8 (may be N bytes > to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) Note that in PyPy unicode strings use UTF-32 as the internal representation for all platforms, so the space saving would be larger. Note also that currently almost all I/O operations on many platforms do a conversion from UTF-8 to UTF-32 and back, which involves a copy and is costly. > 2. There would be very complicate logics for size calculating and > slicing. For UTF-16, every character is represented with a 16-bit > integer, so it is convient for size calculating and slicing. But > character in UTF-8 consumes variant bytes, so either we call mb_* string > functions instead (which is slow in nature) or we use special logic like > storing indices of characters in another array (which introduces cost > for extra addressings). This is true, some engineering would have to go into this part of the representation. > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only > way to be compatible with this format is to convert it back to UTF-16. > It may be wiser to let programmers deside which encoding they would like > to use. If they want to process UTF-8 strings without performance cost > on converting, they should use "bytes". When correct size calculating > and slicing of non-ASCII characters are concerned it may be better to > use "unicode". I think repr is allowed to be a somewhat slow operation. Cheers, Carl Friedrich From yury at shurup.com Mon Mar 7 03:55:50 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Mon, 7 Mar 2016 09:55:50 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: > It uses subprocess, but you need to quit pypy (so run this with --source > and then make separately) for memory to be reclaimed Do you think that pre-forking a process for compilation right at the beginning of the translation when PyPy hasn't consumed much memory yet would be a viable solution? I think if this is practical, it would be a much user friendlier solution as compared to two-step process (translation + compilation). If memory serves me well, this is one of the strategies that subprocess in Python 3 is using to improve on memory consumption. -- Sincerely yours, Yury V. Zaytsev From fijall at gmail.com Mon Mar 7 04:16:42 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:16:42 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: I have no idea how memory management works on windows (I doubt this will solve it), but this is how we do that on linux On Mon, Mar 7, 2016 at 10:55 AM, Yury V. Zaytsev wrote: > On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: > >> It uses subprocess, but you need to quit pypy (so run this with --source >> and then make separately) for memory to be reclaimed > > > Do you think that pre-forking a process for compilation right at the > beginning of the translation when PyPy hasn't consumed much memory yet would > be a viable solution? > > I think if this is practical, it would be a much user friendlier solution as > compared to two-step process (translation + compilation). If memory serves > me well, this is one of the strategies that subprocess in Python 3 is using > to improve on memory consumption. > > > -- > Sincerely yours, > Yury V. Zaytsev From hubo at jiedaibao.com Mon Mar 7 04:21:17 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 17:21:17 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: <56DD480B.2070709@jiedaibao.com> Yes, there are two-words characters in UTF-16, as I mentioned. But len() in CPython returns 2 for these characters (even if they are correctly processed in repr()): >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' u'\U00011409' (Python 3.x seems to have removed the display processing) Maybe it is better to be compatible with CPython in these situations. Since two-words characters are really rare in Unicode strings, programmers may not know their existence and allocate exactly 2 * len(s) bytes for storing an unicode string. It will crash the program or create security problems if len() return 1 for these characters even if it is the correct result according to Unicode standard. UTF-8 might be very useful in XML or Web processing, which is quite important in Python programming nowadays. But I think it is more important to let programmers "understand" the machanism. In C/C++, it is quite common to use char[] for ASCII (or ANSI) characters and wchar_t for unicode (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is actually "UTF-8" in PyPy. Web programmers who uses CPython may already be familiar with the differences between bytes (or str in Python2) and unicode (or str in Python3), it is less likely for them to design their programs based on special implementations of PyPy. 2016-03-07 hubo ????Maciej Fijalkowski ?????2016-03-07 16:46 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"hubo" ???"Armin Rigo","Piotr Jurkiewicz","PyPy Developer Mailing List" Hi hubo. I think you're slightly confusing two things. UTF-16 is a variable-length encoding that has two-word characters that *has to* return "1" for len() of those. UCS-2 seems closer to what you described (which is a fixed-width encoding), but can't encode all the unicode characters and as such is unsuitable for a modern unicode representation. I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the slicing and size calculations still has to be as complicated as for UTF-8. Complicated logic in repr() - those are not usually performance critical parts of your program and it's ok to have some complications there. It's true that UTF-16 can be less efficient than UTF-8 for certain languages, however both are more memory efficient than what we currently use (UCS4). There are however some problems - even if you work exclusively in, say, korean, for example web servers still have to deal with some parts that are ascii (html markup, css etc.) while handling text in korean. In those cases UTF8 vs UTF16 is more muddled and the exact details depend a lot. We also need to consider the fact that we ship one canonical PyPy to everybody - people using different languages and different encodings. Overall, UTF8 seems like definitely a better alternative than UCS4 (also for asian languages), which is what we are using now and I would be inclined to leave UTF16 as an option to see if it performs better for certain benchmarks. Best regards, Maciej Fijalkowski On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode string > type. > > > 1. Less storage - this is not always true. It is only true for strings with > a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more > storage than in UTF-16. To make things worse, while it always consumes 2*N > bytes for a N-characters string in UTF-16, it is difficult to estimate the > size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) > > 2. There would be very complicate logics for size calculating and slicing. > For UTF-16, every character is represented with a 16-bit integer, so it is > convient for size calculating and slicing. But character in UTF-8 consumes > variant bytes, so either we call mb_* string functions instead (which is > slow in nature) or we use special logic like storing indices of characters > in another array (which introduces cost for extra addressings). > > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only way to > be compatible with this format is to convert it back to UTF-16. > > It may be wiser to let programmers deside which encoding they would like to > use. If they want to process UTF-8 strings without performance cost on > converting, they should use "bytes". When correct size calculating and > slicing of non-ASCII characters are concerned it may be better to use > "unicode". > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Armin Rigo > ?????2016-03-05 16:09 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"Piotr Jurkiewicz" > ???"PyPy Developer Mailing List" > > Hi Piotr, > > Thanks for giving some serious thoughts to the utf8-stored unicode > string proposal! > > On 5 March 2016 at 01:48, Piotr Jurkiewicz > wrote: >> Random access would be as follows: >> >> page_num, byte_in_page = divmod(codepoint_pos, 64) >> page_start_byte = index[page_num] >> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >> return buffer[exact_byte] > > This is the part I'm least sure about: seek_forward() needs to be a > loop over 0 to 63 codepoints. True, each loop can be branchless, and > very short---let's say 4 instructions. But it still makes a total of > up to 252 instructions (plus the checks to know if we must go on). > These instructions are all or almost all dependent on the previous > one: you must have finished computing the length of one sequence to > even being computing the length of the next one. Maybe it's faster to > use a more "XMM-izable" algorithm which counts 0 for each byte in > 0x80-0xBF and 1 otherwise, and makes the sum. > > There are also variants, e.g. adding a second array of words similar > to 'index', but where each word is 8 packed bytes giving 8 starting > points inside the page (each in range 0-252). This would reduce the > walk to 0-7 codepoints. > > I'm +1 on your proposal. The whole thing is definitely worth a try. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 7 04:31:10 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:31:10 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD480B.2070709@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: I think you're misunderstanding what we're proposing. We're proposing utf8 representation completely hidden from the user, where everything behaves just like cpython unicode (the len() example you're showing is a narrow unicode build I presume?) On Mon, Mar 7, 2016 at 11:21 AM, hubo wrote: > Yes, there are two-words characters in UTF-16, as I mentioned. But len() in > CPython returns 2 for these characters (even if they are correctly processed > in repr()): > >>>> len(u'\ud805\udc09') > 2 >>>> u'\ud805\udc09' > u'\U00011409' > > (Python 3.x seems to have removed the display processing) > > Maybe it is better to be compatible with CPython in these situations. Since > two-words characters are really rare in Unicode strings, programmers may not > know their existence and allocate exactly 2 * len(s) bytes for storing an > unicode string. It will crash the program or create security problems if > len() return 1 for these characters even if it is the correct result > according to Unicode standard. > > UTF-8 might be very useful in XML or Web processing, which is quite > important in Python programming nowadays. But I think it is more important > to let programmers "understand" the machanism. In C/C++, it is quite common > to use char[] for ASCII (or ANSI) characters and wchar_t for unicode > (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is > actually "UTF-8" in PyPy. Web programmers who uses CPython may already be > familiar with the differences between bytes (or str in Python2) and unicode > (or str in Python3), it is less likely for them to design their programs > based on special implementations of PyPy. > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Maciej Fijalkowski > ?????2016-03-07 16:46 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"hubo" > ???"Armin Rigo","Piotr > Jurkiewicz","PyPy Developer Mailing > List" > > Hi hubo. > > I think you're slightly confusing two things. > > UTF-16 is a variable-length encoding that has two-word characters that > *has to* return "1" for len() of those. UCS-2 seems closer to what you > described (which is a fixed-width encoding), but can't encode all the > unicode characters and as such is unsuitable for a modern unicode > representation. > > I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the > slicing and size calculations still has to be as complicated as for > UTF-8. > > Complicated logic in repr() - those are not usually performance > critical parts of your program and it's ok to have some complications > there. > > It's true that UTF-16 can be less efficient than UTF-8 for certain > languages, however both are more memory efficient than what we > currently use (UCS4). There are however some problems - even if you > work exclusively in, say, korean, for example web servers still have > to deal with some parts that are ascii (html markup, css etc.) while > handling text in korean. In those cases UTF8 vs UTF16 is more muddled > and the exact details depend a lot. We also need to consider the fact > that we ship one canonical PyPy to everybody - people using different > languages and different encodings. > > Overall, UTF8 seems like definitely a better alternative than UCS4 > (also for asian languages), which is what we are using now and I would > be inclined to leave UTF16 as an option to see if it performs better > for certain benchmarks. > > Best regards, > Maciej Fijalkowski > > On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: >> I think it is not reasonable to use UTF-8 to represent the unicode string >> type. >> >> >> 1. Less storage - this is not always true. It is only true for strings >> with >> a lot of ASCII characters. In Asia, most strings in local languages >> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume >> more >> storage than in UTF-16. To make things worse, while it always consumes 2*N >> bytes for a N-characters string in UTF-16, it is difficult to estimate the >> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) >> (UTF-16 also has two-word characters, but len() reports 2 for these >> characters, I think it is not harmful to treat them as two characters) >> >> 2. There would be very complicate logics for size calculating and slicing. >> For UTF-16, every character is represented with a 16-bit integer, so it is >> convient for size calculating and slicing. But character in UTF-8 consumes >> variant bytes, so either we call mb_* string functions instead (which is >> slow in nature) or we use special logic like storing indices of characters >> in another array (which introduces cost for extra addressings). >> >> 3. When displaying with repr(), non-ASCII characters are displayed with >> \uXXXX format. If the internal storage for unicode is UTF-8, the only way >> to >> be compatible with this format is to convert it back to UTF-16. >> >> It may be wiser to let programmers deside which encoding they would like >> to >> use. If they want to process UTF-8 strings without performance cost on >> converting, they should use "bytes". When correct size calculating and >> slicing of non-ASCII characters are concerned it may be better to use >> "unicode". >> >> 2016-03-07 >> ________________________________ >> hubo >> ________________________________ >> >> ????Armin Rigo >> ?????2016-03-05 16:09 >> ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage >> ????"Piotr Jurkiewicz" >> ???"PyPy Developer Mailing List" >> >> Hi Piotr, >> >> Thanks for giving some serious thoughts to the utf8-stored unicode >> string proposal! >> >> On 5 March 2016 at 01:48, Piotr Jurkiewicz >> wrote: >>> Random access would be as follows: >>> >>> page_num, byte_in_page = divmod(codepoint_pos, 64) >>> page_start_byte = index[page_num] >>> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >>> return buffer[exact_byte] >> >> This is the part I'm least sure about: seek_forward() needs to be a >> loop over 0 to 63 codepoints. True, each loop can be branchless, and >> very short---let's say 4 instructions. But it still makes a total of >> up to 252 instructions (plus the checks to know if we must go on). >> These instructions are all or almost all dependent on the previous >> one: you must have finished computing the length of one sequence to >> even being computing the length of the next one. Maybe it's faster to >> use a more "XMM-izable" algorithm which counts 0 for each byte in >> 0x80-0xBF and 1 otherwise, and makes the sum. >> >> There are also variants, e.g. adding a second array of words similar >> to 'index', but where each word is 8 packed bytes giving 8 starting >> points inside the page (each in range 0-252). This would reduce the >> walk to 0-7 codepoints. >> >> I'm +1 on your proposal. The whole thing is definitely worth a try. >> >> >> A bient?t, >> >> Armin. >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> From fijall at gmail.com Mon Mar 7 04:33:19 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:33:19 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DA2CE9.5070409@gmail.com> References: <56DA2CE9.5070409@gmail.com> Message-ID: Hi Piotr. Any chance to have a chat with you about the proposal on a more real-time communication medium like IRC or GChat? (it's #pypy on IRC and use my mail for gchat) On Sat, Mar 5, 2016 at 2:48 AM, Piotr Jurkiewicz wrote: > Hi PyPy devs, > > my name is Piotr Jurkiewicz and I am a first-year PhD student at > the AGH University of Science and Technology, Krak?w, Poland. > > I am writing this email to make sure that PyPy is going to > participate in GSoC 2016, since I am interested in one of the > proposed projects: Optimized Unicode Representation > > Below is a list of my ideas and plan for the project. > > (I use Python 2 nomenclature, that is unicode strings are > `unicode` objects and bytes strings are `str` objects.) > > 1. Store all unicode objects contents internally as UTF-8. > > This would reduce size of stored contents and allow external > libraries, which expect UTF-8, to process contents directly in the > memory (for example using various regexp libraries to search unicode > string). > > 2. Unify interning caches for str and unicode. > > This would allow unicode objects and corresponding > utf8-encoded-str objects to share the same interned buffer. > > For example unicode object u'ko?' would share interned buffer > with str 'ko\xc5\x84'. > > This would make unicode.encode('utf-8') basically no op. As UTF-8 > becomes dominant encoding for any data exchange, including web (86%) > [1], more and more data coming out from Python scripts needs to be > UTF-8 encoded. Therefore, it is important to make this operation as > cheap as possible. > > It would speed up str.decode('utf-8') significantly too, although it > wouldn't make it no op. String still would need to be checked if it > is a correct UTF-8 string when transforming to unicode object. But > we can get rid of additional allocation, copying string contents and > storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE. > > 3. Indexing of codepoints positions, what would allow O(1) random > access and slicing. > > The idea is simple: alongside contents of each interned unicode > object, store an array of unsigned integers. These integers will > be positions (in bytes), counting from the beginning of the buffer, > at which each next 64-codepoint-long 'pages' start. > > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] > > Using 64-byte long pages, like in the example above, would allow > O(1) random access, with constant terms of: > > - one cache access in cases of only-ASCII texts (indexes for such > unicode objects will not be created and maintained) > - three cache accesses in cases of texts consisting of ASCII mixed > with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic > alphabets) > - four or five cache accesses in cases of texts consisting mostly of > three- and four- byte characters > > (all above assuming 64-byte long CPU cache lines) > > Memory overhead associated with storing index array would be in > range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32 > codepoints will be allowed) > > (assuming that the index array consists of integers of smallest > possible type which can store buffer_bytes_len - 1) > > 4. Fast codepoints counting/seeking with branchless algorithm [2]. > > When unicode object is interned, we are sure that it is a correct > UTF-8 string. Therefore, there is no need for correctness checking > when seeking, so a branchless algorithm can be used. > > [1]: http://w3techs.com/technologies/details/en-utf8/all/all > [2]: > http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html > > All of these changes can be introduced one at a time, what would > improve tracking of performance changes and debugging of eventual > errors. > > After completing the project I plan to write a paper describing > speedup method of random access unicode access based on indexing, as > this method has a potential for being used in other language > interpreters which have immutable and/or interned unicode strings. > Note that similar index can be created for graphemes as well, so > this method can be used in languages which provide grapheme-based > interface (like Perl 6). > > Please share your thoughts about these ideas. > > Cheers, > Piotr > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From hubo at jiedaibao.com Mon Mar 7 04:45:51 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 17:45:51 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: <56DD4DCB.3070407@jiedaibao.com> Yes, it seems CPython 2.7 in Windows uses UTF-16, so: >>> '\ud805\udc09' '\\ud805\\udc09' >>> u'\ud805\udc09' u'\U00011409' >>> u'\ud805\udc09' == u'\U00011409' True >>> len(u'\U00011409') 2 In Linux CPython 2.7: >>> u'\U00011409' u'\U00011409' >>> len(u'\U00011409') 1 >>> u'\ud805\udc09' u'\ud805\udc09' >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' == u'\U00011409' False >>> u'\ud805\udc09'.encode('utf-8') '\xf0\x91\x90\x89' >>> u'\U00011409'.encode('utf-8') '\xf0\x91\x90\x89' >>> u'\ud805\udc09'.encode('utf-8') == u'\U00011409'.encode('utf-8') True 2016-03-07 hubo ????Maciej Fijalkowski ?????2016-03-07 17:31 ???Re: Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"hubo" ???"Armin Rigo","Piotr Jurkiewicz","PyPy Developer Mailing List" I think you're misunderstanding what we're proposing. We're proposing utf8 representation completely hidden from the user, where everything behaves just like cpython unicode (the len() example you're showing is a narrow unicode build I presume?) On Mon, Mar 7, 2016 at 11:21 AM, hubo wrote: > Yes, there are two-words characters in UTF-16, as I mentioned. But len() in > CPython returns 2 for these characters (even if they are correctly processed > in repr()): > >>>> len(u'\ud805\udc09') > 2 >>>> u'\ud805\udc09' > u'\U00011409' > > (Python 3.x seems to have removed the display processing) > > Maybe it is better to be compatible with CPython in these situations. Since > two-words characters are really rare in Unicode strings, programmers may not > know their existence and allocate exactly 2 * len(s) bytes for storing an > unicode string. It will crash the program or create security problems if > len() return 1 for these characters even if it is the correct result > according to Unicode standard. > > UTF-8 might be very useful in XML or Web processing, which is quite > important in Python programming nowadays. But I think it is more important > to let programmers "understand" the machanism. In C/C++, it is quite common > to use char[] for ASCII (or ANSI) characters and wchar_t for unicode > (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is > actually "UTF-8" in PyPy. Web programmers who uses CPython may already be > familiar with the differences between bytes (or str in Python2) and unicode > (or str in Python3), it is less likely for them to design their programs > based on special implementations of PyPy. > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Maciej Fijalkowski > ?????2016-03-07 16:46 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"hubo" > ???"Armin Rigo","Piotr > Jurkiewicz","PyPy Developer Mailing > List" > > Hi hubo. > > I think you're slightly confusing two things. > > UTF-16 is a variable-length encoding that has two-word characters that > *has to* return "1" for len() of those. UCS-2 seems closer to what you > described (which is a fixed-width encoding), but can't encode all the > unicode characters and as such is unsuitable for a modern unicode > representation. > > I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the > slicing and size calculations still has to be as complicated as for > UTF-8. > > Complicated logic in repr() - those are not usually performance > critical parts of your program and it's ok to have some complications > there. > > It's true that UTF-16 can be less efficient than UTF-8 for certain > languages, however both are more memory efficient than what we > currently use (UCS4). There are however some problems - even if you > work exclusively in, say, korean, for example web servers still have > to deal with some parts that are ascii (html markup, css etc.) while > handling text in korean. In those cases UTF8 vs UTF16 is more muddled > and the exact details depend a lot. We also need to consider the fact > that we ship one canonical PyPy to everybody - people using different > languages and different encodings. > > Overall, UTF8 seems like definitely a better alternative than UCS4 > (also for asian languages), which is what we are using now and I would > be inclined to leave UTF16 as an option to see if it performs better > for certain benchmarks. > > Best regards, > Maciej Fijalkowski > > On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: >> I think it is not reasonable to use UTF-8 to represent the unicode string >> type. >> >> >> 1. Less storage - this is not always true. It is only true for strings >> with >> a lot of ASCII characters. In Asia, most strings in local languages >> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume >> more >> storage than in UTF-16. To make things worse, while it always consumes 2*N >> bytes for a N-characters string in UTF-16, it is difficult to estimate the >> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) >> (UTF-16 also has two-word characters, but len() reports 2 for these >> characters, I think it is not harmful to treat them as two characters) >> >> 2. There would be very complicate logics for size calculating and slicing. >> For UTF-16, every character is represented with a 16-bit integer, so it is >> convient for size calculating and slicing. But character in UTF-8 consumes >> variant bytes, so either we call mb_* string functions instead (which is >> slow in nature) or we use special logic like storing indices of characters >> in another array (which introduces cost for extra addressings). >> >> 3. When displaying with repr(), non-ASCII characters are displayed with >> \uXXXX format. If the internal storage for unicode is UTF-8, the only way >> to >> be compatible with this format is to convert it back to UTF-16. >> >> It may be wiser to let programmers deside which encoding they would like >> to >> use. If they want to process UTF-8 strings without performance cost on >> converting, they should use "bytes". When correct size calculating and >> slicing of non-ASCII characters are concerned it may be better to use >> "unicode". >> >> 2016-03-07 >> ________________________________ >> hubo >> ________________________________ >> >> ????Armin Rigo >> ?????2016-03-05 16:09 >> ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage >> ????"Piotr Jurkiewicz" >> ???"PyPy Developer Mailing List" >> >> Hi Piotr, >> >> Thanks for giving some serious thoughts to the utf8-stored unicode >> string proposal! >> >> On 5 March 2016 at 01:48, Piotr Jurkiewicz >> wrote: >>> Random access would be as follows: >>> >>> page_num, byte_in_page = divmod(codepoint_pos, 64) >>> page_start_byte = index[page_num] >>> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >>> return buffer[exact_byte] >> >> This is the part I'm least sure about: seek_forward() needs to be a >> loop over 0 to 63 codepoints. True, each loop can be branchless, and >> very short---let's say 4 instructions. But it still makes a total of >> up to 252 instructions (plus the checks to know if we must go on). >> These instructions are all or almost all dependent on the previous >> one: you must have finished computing the length of one sequence to >> even being computing the length of the next one. Maybe it's faster to >> use a more "XMM-izable" algorithm which counts 0 for each byte in >> 0x80-0xBF and 1 otherwise, and makes the sum. >> >> There are also variants, e.g. adding a second array of words similar >> to 'index', but where each word is 8 packed bytes giving 8 starting >> points inside the page (each in range 0-252). This would reduce the >> walk to 0-7 codepoints. >> >> I'm +1 on your proposal. The whole thing is definitely worth a try. >> >> >> A bient?t, >> >> Armin. >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tritium-list at sdamon.com Mon Mar 7 05:27:38 2016 From: tritium-list at sdamon.com (Alexander Walters) Date: Mon, 7 Mar 2016 05:27:38 -0500 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com>

Message-ID: <56DD579A.8080603@sdamon.com> Forking is not an option on windows (it lacks fork.) On 3/7/2016 04:16, Maciej Fijalkowski wrote: > I have no idea how memory management works on windows (I doubt this > will solve it), but this is how we do that on linux > > On Mon, Mar 7, 2016 at 10:55 AM, Yury V. Zaytsev wrote: >> On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: >> >>> It uses subprocess, but you need to quit pypy (so run this with --source >>> and then make separately) for memory to be reclaimed >> >> Do you think that pre-forking a process for compilation right at the >> beginning of the translation when PyPy hasn't consumed much memory yet would >> be a viable solution? >> >> I think if this is practical, it would be a much user friendlier solution as >> compared to two-step process (translation + compilation). If memory serves >> me well, this is one of the strategies that subprocess in Python 3 is using >> to improve on memory consumption. >> >> >> -- >> Sincerely yours, >> Yury V. Zaytsev > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From steve at pearwood.info Mon Mar 7 06:45:45 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 7 Mar 2016 22:45:45 +1100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: <20160307114545.GZ12028@ando.pearwood.info> On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?) Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve From hubo at jiedaibao.com Mon Mar 7 07:49:24 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 20:49:24 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <20160307114545.GZ12028@ando.pearwood.info> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> Message-ID: <56DD78D1.30309@jiedaibao.com> Thanks for the link! It is interesting that in Python3.5, still >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' == u'\U00011409' False I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', it is just an illegal unicode string. It also raises UnicodeEncodeError if you try to encode it into UTF-8. The problem is that it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I don't think it is possible to keep these details same as CPython, but it should be acceptable. Thanks again for the discussion. Unicode is really complicated. 2016-03-07 hubo ????Steven D'Aprano ?????2016-03-07 19:45 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"pypy-dev" ??? On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?) Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve _______________________________________________ pypy-dev mailing list pypy-dev at python.org https://mail.python.org/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Mar 8 06:49:47 2016 From: matti.picus at gmail.com (matti picus) Date: Tue, 8 Mar 2016 13:49:47 +0200 Subject: [pypy-dev] release seems ready Message-ID: It seems we have a release, version ad5a4e55fa8e. Is there a reason to wait? buildbots http://buildbot.pypy.org/summary?branch=release-5.x release notice http://doc.pypy.org/en/latest/release-5.0.0.html Hopefully we can release 5.1 once s360-x lands on default Matti -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 8 08:41:36 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 15:41:36 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: yay! can we call it rc1? if noone objects we'll make rc1 the release say in 24 or 48h On Tue, Mar 8, 2016 at 1:49 PM, matti picus wrote: > It seems we have a release, version ad5a4e55fa8e. Is there a reason to wait? > buildbots http://buildbot.pypy.org/summary?branch=release-5.x > release notice http://doc.pypy.org/en/latest/release-5.0.0.html > > Hopefully we can release 5.1 once s360-x lands on default > Matti > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From matti.picus at gmail.com Tue Mar 8 09:15:34 2016 From: matti.picus at gmail.com (matti picus) Date: Tue, 8 Mar 2016 16:15:34 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: We could package it and upload as rc1, but version_info will not have rc1 unless we rerun the builds. Confusing. I prefer to apologize if we get it wrong and release a 5.0.1 bugfix Matti On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: > yay! > > can we call it rc1? if noone objects we'll make rc1 the release say in 24 > or 48h > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Mar 8 09:26:10 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 15:26:10 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: Hi Matti, On 8 March 2016 at 15:15, matti picus wrote: > We could package it and upload as rc1, but version_info will not have rc1 > unless we rerun the builds. Confusing. > I prefer to apologize if we get it wrong and release a 5.0.1 bugfix +1. Go ahead as far as I'm concerned. About the release notice: "As a result, lxml with its cython compiled component passes all tests on PyPy" is not clear until the next official lxml is released. The current lxml 3.5.0 still contains a partially buggy workaround that tries to make it work on previous versions of cpyext. The trunk version at https://github.com/lxml/lxml has got this code removed, and that's the version that works. I'll make the ppc releases once the other releases are out. A bient?t, Armin From phyo.arkarlwin at gmail.com Tue Mar 8 09:22:56 2016 From: phyo.arkarlwin at gmail.com (Phyo Arkar) Date: Tue, 08 Mar 2016 14:22:56 +0000 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: I am going to test it out , quite interesting release. On Tue, Mar 8, 2016 at 8:12 PM Maciej Fijalkowski wrote: > yay! > > can we call it rc1? if noone objects we'll make rc1 the release say in 24 > or 48h > > On Tue, Mar 8, 2016 at 1:49 PM, matti picus wrote: > > It seems we have a release, version ad5a4e55fa8e. Is there a reason to > wait? > > buildbots http://buildbot.pypy.org/summary?branch=release-5.x > > release notice http://doc.pypy.org/en/latest/release-5.0.0.html > > > > Hopefully we can release 5.1 once s360-x lands on default > > Matti > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 8 09:36:01 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:36:01 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: I'm ok with making it official 5.0. We can always do 5.0.1 if there are problems On Tue, Mar 8, 2016 at 4:15 PM, matti picus wrote: > We could package it and upload as rc1, but version_info will not have rc1 > unless we rerun the builds. Confusing. > I prefer to apologize if we get it wrong and release a 5.0.1 bugfix > Matti > > On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: >> >> yay! >> >> can we call it rc1? if noone objects we'll make rc1 the release say in 24 >> or 48h >> >> > From fijall at gmail.com Tue Mar 8 09:42:05 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:42:05 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: btw, should we mention packages.pypy.org? On Tue, Mar 8, 2016 at 4:36 PM, Maciej Fijalkowski wrote: > I'm ok with making it official 5.0. We can always do 5.0.1 if there are problems > > On Tue, Mar 8, 2016 at 4:15 PM, matti picus wrote: >> We could package it and upload as rc1, but version_info will not have rc1 >> unless we rerun the builds. Confusing. >> I prefer to apologize if we get it wrong and release a 5.0.1 bugfix >> Matti >> >> On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: >>> >>> yay! >>> >>> can we call it rc1? if noone objects we'll make rc1 the release say in 24 >>> or 48h >>> >>> >> From arigo at tunes.org Tue Mar 8 09:46:26 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 15:46:26 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: Hi, On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: > btw, should we mention packages.pypy.org? I would do so but only under two conditions: * it reports a post-cpyext-fixes result: which packages run or don't run now, ideally on the current "release 5.0" branch, but at least after the merge of the cpyext-gc-support-2 branch * we quickly review and fix the few manual comments, notably lxml's (we no longer recommend lxml-cffi). A bient?t, Armin From fijall at gmail.com Tue Mar 8 09:52:56 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:52:56 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: Cool, I'm happy to do the suggested fixes. We rerun it every release usually, changes by hand are done earlier. Should I start a run on the current release branch? On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: > Hi, > > On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >> btw, should we mention packages.pypy.org? > > I would do so but only under two conditions: > > * it reports a post-cpyext-fixes result: which packages run or don't > run now, ideally on the current "release 5.0" branch, but at least > after the merge of the cpyext-gc-support-2 branch > > * we quickly review and fix the few manual comments, notably lxml's > (we no longer recommend lxml-cffi). > > > A bient?t, > > Armin From fijall at gmail.com Tue Mar 8 09:53:19 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:53:19 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: in other words, it shows the last release of pypy, not "trunk" On Tue, Mar 8, 2016 at 4:52 PM, Maciej Fijalkowski wrote: > Cool, I'm happy to do the suggested fixes. > > We rerun it every release usually, changes by hand are done earlier. > Should I start a run on the current release branch? > > On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: >> Hi, >> >> On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >>> btw, should we mention packages.pypy.org? >> >> I would do so but only under two conditions: >> >> * it reports a post-cpyext-fixes result: which packages run or don't >> run now, ideally on the current "release 5.0" branch, but at least >> after the merge of the cpyext-gc-support-2 branch >> >> * we quickly review and fix the few manual comments, notably lxml's >> (we no longer recommend lxml-cffi). >> >> >> A bient?t, >> >> Armin From fijall at gmail.com Tue Mar 8 10:11:56 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 17:11:56 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: ugh, btw, it seems someone broke embedding (as advertised, probably the cffi embedding still works) On Tue, Mar 8, 2016 at 4:53 PM, Maciej Fijalkowski wrote: > in other words, it shows the last release of pypy, not "trunk" > > On Tue, Mar 8, 2016 at 4:52 PM, Maciej Fijalkowski wrote: >> Cool, I'm happy to do the suggested fixes. >> >> We rerun it every release usually, changes by hand are done earlier. >> Should I start a run on the current release branch? >> >> On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: >>> Hi, >>> >>> On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >>>> btw, should we mention packages.pypy.org? >>> >>> I would do so but only under two conditions: >>> >>> * it reports a post-cpyext-fixes result: which packages run or don't >>> run now, ideally on the current "release 5.0" branch, but at least >>> after the merge of the cpyext-gc-support-2 branch >>> >>> * we quickly review and fix the few manual comments, notably lxml's >>> (we no longer recommend lxml-cffi). >>> >>> >>> A bient?t, >>> >>> Armin From arigo at tunes.org Tue Mar 8 10:16:21 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 16:16:21 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD78D1.30309@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com> Message-ID: Hi hubo, On 7 March 2016 at 13:49, hubo wrote: > I think in Python 3.x, u'\ud805\udc09' is not another format of > u'\U00011409', it is just an illegal unicode string. It also raises > UnicodeEncodeError if you try to encode it into UTF-8. The problem is that > it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as > the internal storage format, I don't think it is possible to keep these > details same as CPython, but it should be acceptable. We're good at keeping obscure details the same as CPython. It's only a matter of adding the correct checks on top of the encode() and decode() methods, independently of the underlying representation. In this case, because we can consider the length-1 unicode string u'\ud805', then we have to internally represent it somehow, and the natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus not using utf-8 internally, but "utf-8-without-extra-consistency-checks". In Python 2, u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a single code point of 4 bytes. This means that calling ``decode('utf-8')`` has to check for surrogates, and do something more complicated on Python 2.x (or complain on Python 3.x). In other words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be no-ops. Decoding and encoding need to check the data, and might actually need to make a copy in corner cases, but not in the vast majority of cases. This is all focused on the web and generally Linux approach of "utf-8 everywhere". For Windows, the story is more complicated. CPython 2.x uses UTF-16, like the Windows API. However, the recent CPython 3.x moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). If you are on a recent CPython 3.x and build a unicode object with a large codepoint, and then call the Windows API with it, it will need anyway to convert it to UTF-16 dynamically, as far as I can tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is discussed here, it would instead have to convert from utf-8-without-extra-consistency-checks to UTF-16 in that situation. There are definitely trade-offs to explore, but I doubt that we can fully explore these trade-offs without actually trying it out. A bient?t, Armin. From robin.kruppe at gmail.com Tue Mar 8 11:10:57 2016 From: robin.kruppe at gmail.com (Robin Kruppe) Date: Tue, 8 Mar 2016 17:10:57 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com> Message-ID: Hi all, I just wanted to mention that several other language implementors have faced the same problem of dealing with "UTF-16" containing lone surrogate code points and representing it in "UTF-8", and they have come up with essentially the same solution. Users include the Racket, Scheme 48, and Rust languages (all three only for I/O on Windows) and the Servo browser engine (for the sake of JavaScript). Recently Simon Sapin of Mozilla has spec'd this trick in exhausting detail, christening it WTF-8: https://simonsapin.github.io/wtf-8/ While everything described there may be pretty obvious (for those immersed in the guts of Unicode), I wanted to raise awareness that this has a name and other users. Cheers, Robin On 8 March 2016 at 16:16, Armin Rigo wrote: > Hi hubo, > > On 7 March 2016 at 13:49, hubo wrote: > > I think in Python 3.x, u'\ud805\udc09' is not another format of > > u'\U00011409', it is just an illegal unicode string. It also raises > > UnicodeEncodeError if you try to encode it into UTF-8. The problem is > that > > it is legal to define and use these strings. If PyPy uses UTF-8 or > UTF-16 as > > the internal storage format, I don't think it is possible to keep these > > details same as CPython, but it should be acceptable. > > We're good at keeping obscure details the same as CPython. It's only > a matter of adding the correct checks on top of the encode() and > decode() methods, independently of the underlying representation. > > In this case, because we can consider the length-1 unicode string > u'\ud805', then we have to internally represent it somehow, and the > natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. > So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus > not using utf-8 internally, but > "utf-8-without-extra-consistency-checks". In Python 2, > u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a > single code point of 4 bytes. This means that calling > ``decode('utf-8')`` has to check for surrogates, and do something more > complicated on Python 2.x (or complain on Python 3.x). In other > words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be > no-ops. Decoding and encoding need to check the data, and might > actually need to make a copy in corner cases, but not in the vast > majority of cases. > > This is all focused on the web and generally Linux approach of "utf-8 > everywhere". For Windows, the story is more complicated. CPython 2.x > uses UTF-16, like the Windows API. However, the recent CPython 3.x > moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). > If you are on a recent CPython 3.x and build a unicode object with a > large codepoint, and then call the Windows API with it, it will need > anyway to convert it to UTF-16 dynamically, as far as I can > tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is > discussed here, it would instead have to convert from > utf-8-without-extra-consistency-checks to UTF-16 in that situation. > > There are definitely trade-offs to explore, but I doubt that we can > fully explore these trade-offs without actually trying it out. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Mar 8 11:30:12 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 17:30:12 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com>

Message-ID: Hi Robin, On 8 March 2016 at 17:10, Robin Kruppe wrote: > I just wanted to mention that several other language implementors have faced > ... > While everything described there may be pretty obvious (for those immersed > in the guts of Unicode), I wanted to raise awareness that this has a name > and other users. Thanks! We'd be using the "generalized UTF-8" from https://simonsapin.github.io/wtf-8/, in principle. We'd not be using WTF-8 because it considers that u'\ud805\udc09' == u'\U00011409', whereas CPython does not, generally. A bient?t, Armin. From djkonro35 at gmail.com Wed Mar 9 08:12:03 2016 From: djkonro35 at gmail.com (Djimeli Konrad) Date: Wed, 9 Mar 2016 14:12:03 +0100 Subject: [pypy-dev] Interest in contributing to PYPY Message-ID: Hello, My name is Djimeli Konrad a second year computer science student from the University of Buea, Cameroon. I am proficient in c, c++, javascript and python. I would like to contribute to PYPY for the Google Summer of Code 2016. I am interested in working on the project "Improving the jitviewer". I have previous experience developing Django/Python applications ( https://github.com/MCQuizzer/mcquizzer/graphs/contributors ), VRML-STL parser hosted on github ( https://github.com/djkonro/vrml-stl ) and other project ( https://github.com/djkonro ). I would like to work on this project within and beyond GSoC and as I have always sought for such a project ever since I learned python and web application development.I would like to get some pointer to some starting point that could give me a better understanding of the project. Thanks Konrad -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 10 01:41:03 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 10 Mar 2016 08:41:03 +0200 Subject: [pypy-dev] Interest in contributing to PYPY In-Reply-To: References: Message-ID: Hi! Good to hear from you :-) Any chance you can pop in to IRC, so we can discuss the project? Alternatively you can catch me on gmail on this address Best regards, Maciej Fijalkowski On Wed, Mar 9, 2016 at 3:12 PM, Djimeli Konrad wrote: > Hello, > > My name is Djimeli Konrad a second year computer science student from the > University of Buea, Cameroon. I am proficient in c, c++, javascript and > python. I would like to contribute to PYPY for the Google Summer of Code > 2016. I am interested in working on the project "Improving the jitviewer". I > have previous experience developing Django/Python applications ( > https://github.com/MCQuizzer/mcquizzer/graphs/contributors ), VRML-STL > parser hosted on github ( https://github.com/djkonro/vrml-stl ) and other > project ( https://github.com/djkonro ). I would like to work on this > project within and beyond GSoC and as I have always sought for such a > project ever since I learned python and web application development.I would > like to get some pointer to some starting point that could give me a better > understanding of the project. > > Thanks > Konrad > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From ishankhare07 at gmail.com Thu Mar 10 11:31:36 2016 From: ishankhare07 at gmail.com (Ishan Khare) Date: Thu, 10 Mar 2016 16:31:36 +0000 Subject: [pypy-dev] Contribute in GSOC Message-ID: Hi, I am a newcomer to contributing to pypy, but I'm fairly good in python & c. I would like to contribute to PyPy. Are all ideas listed in Potential project list eligible for GSOC. Where should I probably get started? Regards, Ishan -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfbolz at gmx.de Fri Mar 11 09:09:31 2016 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 11 Mar 2016 15:09:31 +0100 Subject: [pypy-dev] Call for Papers Programming Experience 2016 Message-ID: <56E2D19B.3030505@gmx.de> Call for Papers *** Programming Experience 2016 (PX/16) Workshop *** July 18 (Mon), 2016 Co-located with ECOOP 2016 in Rome 2016.ecoop.org/track/PX-2016 programming-experience.org/px16 === Abstract === Imagine a software development task. Some sort of requirements and specification including performance goals and perhaps a platform and programming language. A group of developers head into a vast workroom. The Programming Experience Workshop is about what happens in that room when one or a couple of programmers sit down in front of computers and produce code, especially when it's exploratory programming. Do they create text that is transformed into running behavior (the old way), or do they operate on behavior directly ("liveness"); are they exploring the live domain to understand the true nature of the requirements; are they like authors creating new worlds; does visualization matter; is the experience immediate, immersive, vivid and continuous; do fluency, literacy, and learning matter; do they build tools, meta-tools; are they creating languages to express new concepts quickly and easily; and curiously, is joy relevant to the experience? Correctness, performance, standard tools, foundations, and text-as-program are important traditional research areas, but the experience of programming and how to improve and evolve it are the focus of this workshop. === Submissions === Submissions are solicited for Programming Experience 2016 (PX/16). The thrust of the workshop is to explore the human experience of programming?what it feels like to program, or more accurately, what it should feel like. The technical topics include exploratory programming, live programming, authoring, representation of active content, visualization, navigation, modularity mechanisms, immediacy, literacy, fluency, learning, tool building, and language engineering. Submissions by academics, professional programmers, and non-professional programmer are welcome. Submissions can be in any form and format, including but not limited to papers, presentations, demos, videos, panels, debates, essays, writers' workshops, and art. Presentation slots will be between 30 minutes and one hour, depending on quality, form, and relevance to the workshop. Submissions directed toward publication should be so marked, and the program committee will engage in peer review for all such papers. Video publication will be arranged. All artifacts are to be submitted via EasyChair (https://easychair.org/conferences/?conf=px16). Papers and essays must be written in English, provided as PDF documents, and follow the ACM SIGPLAN Conference Format (10 point font, Times New Roman font family, numeric citation style, http://www.sigplan.org/Resources/Author/). There is no page limit on submitted papers and essays. It is, however, the responsibility of the authors to keep the reviewers interested and motivated to read the paper. Reviewers are under no obligation to read all or even a substantial portion of a paper or essay if they do not find the initial part of it interesting. === Format === Paper presentations, presentations without papers, live demonstrations, performances, videos, panel discussions, debates, writers' workshops, art galleries, dramatic readings. === Review === Papers and essays labeled as publications will undergo standard peer review; other submissions will be reviewed for relevance and quality; shepherding will be available. === Important dates === Submissions: April 15, 2016 (anywhere in the world) Notifications: May 13, 2016 PX/16: July 18, 2016 === Publication === Papers and essays accepted through peer review will be published as part of ACM's Digital Library; video publication on Vimeo or other streaming site; other publication on the PX workshop website. === Organizers === Robert Hirschfeld, Hasso Plattner Institute, University of Potsdam, Germany Richard P. Gabriel, Dreamsongs and IBM Almaden Research Center, United States Hidehiko Masuhara, Mathematical and Computing Science, Tokyo Institute of Technology, Japan === Program committee === Carl Friedrich Bolz, King's College London, United Kingdom Gilad Bracha, Google, United States Andrew Bragdon, Twitter, United States Jonathan Edwards, CDG Labs, United States Jun Kato, National Institute of Advanced Industrial Science and Technology, Japan Cristina Videira Lopes, University of California at Irvine, United States Yoshiki Ohshima, Viewpoints Research Institute, United States Michael Perscheid, SAP Innovation Center, Germany Guido Salvaneschi, TU Darmstadt, Germany Marcel Taeumel, Hasso Plattner Institute, University of Potsdam, Germany Alessandro Warth, SAP Labs, United States From nkumar736 at gmail.com Fri Mar 11 16:58:30 2016 From: nkumar736 at gmail.com (Naveen Kumar) Date: Sat, 12 Mar 2016 03:28:30 +0530 Subject: [pypy-dev] GSoC 2016 Message-ID: Hello, I'm Naveen Kumar, an Information Science Engineering student from Bangalore, India. I got to know about PyPy from a book that I started studying the book "Expert Python Programming" by Tarek Ziad? and I was totally Intrigued. I take this opportunity to be a part of the community and contribute actively. As for me, I've been using Python from the past 8 months and I built a Blog using Flask (following the footsteps of Miguel Grinberg) with features like a Music Player. Other than that, I do not have much of an experience. Again, I'd love to be a part of the community and I'd like to be guided on how to go about it. Thanks, Naveen (nkumar736 at gmail.com) -------------- next part -------------- An HTML attachment was scrubbed... URL: From pabi.lenka at gmail.com Sat Mar 12 05:52:43 2016 From: pabi.lenka at gmail.com (Pabitra Lenka) Date: Sat, 12 Mar 2016 16:22:43 +0530 Subject: [pypy-dev] TO GET STARTED Message-ID: Greetings Developers, I am a newbie.I would like to contribute to your organization.Can anyone get me started.? -- Cheers, Pabitra Lenka Department of Information Technology Class of 2018 IIIT Bhubaneswar From djkonro35 at gmail.com Mon Mar 14 04:46:31 2016 From: djkonro35 at gmail.com (Djimeli Konrad) Date: Mon, 14 Mar 2016 09:46:31 +0100 Subject: [pypy-dev] Fwd: Interest in contributing to PYPY In-Reply-To: References:

Message-ID: Hello, As discoursed on IRC, I am trying to develop a parser for Jitviewer, that is not dependent on rpython for my first patch. I am new to Pypy and I would like to get some help/pointer that would help me accomplish this task. Mainly resources on how log files are generated. I would also like to get more details on what improvements are to be done with respect Jitviewer, for GSOC 2016, as application are about to start. So far in trying to generate a log file, I have tried the following commands; PYPYLOG=jit-backend:/home/konro/jitviewer/logfile pypy ../source.py (to generate the log file) and I got the following output http://pastebin.com/xv7nS1i2 But when I try to view the file with Jitviewer, I get errors http://pastebin.com/LFBB12sj Please I need some help to identify what I am doing wrong. Thanks Konrad From nzinov at gmail.com Mon Mar 14 14:15:45 2016 From: nzinov at gmail.com (=?UTF-8?B?0J3QuNC60L7Qu9Cw0Lkg0JfQuNC90L7Qsg==?=) Date: Mon, 14 Mar 2016 18:15:45 +0000 Subject: [pypy-dev] Copy-on-write list slicing as GSoC project Message-ID: Hello dear PyPy developers, My name is Nikolay Zinov. I am a sophomore student at Moscow Institute of Physics and Technology. I am very interested in contributing to PyPy as a GSoC project. I found implementing copy-on-write list slicing particularly interesting for me. Below go my ideas. Note, that at some places I see different possible choices so I need feedback. 1. What we want to get is *myslice = mylist[a:b]* only cause data copying if *myslice* or *mylist* are mutated. 2. This can be implemented by creating a special list strategy. When getslice operation is performed, the original list is switched to that strategy and a new list with shared storage is created. Storage layout is a tuple of reference counter and the underlying RPython list. This storage would be shared between several W_ListObject instances. A field containing slice object representing would be added to the W_ListObject. List operations are implemented as follows: non modifying ops perform indices conversion and proxy the call to the underlying strategy; modifying ops cause new list creation with normal strategy. If a slice of a slice is taken we can calculate the resulting slice of the original list. 3. Some drawbacks of this solution. a) Additional field (slice object) added to W_ListObject. Another option would be to make this value a part of the storage. However, this value is unique for the slice while other data are shared. Therefore, it would require an additional level of indirection with the W_ListObject pointing to some header which in its turn points to shared data. b) If the original list is modified it is copied and not the (probably smaller) slice. The solution would be quite complicated with the original list storing references to all its slices. The good thing is that this scenario (create a slice -> modify the original list) is quite rare (or it would be if not for the next problem). c) Copy-on-write is inefficient in a GC'd environment. Abandoned slice can take a while to be freed and till then it will block modifying operations on the original list. I see no good solution for this problem but for keeping reference counter in the slice instance which is probably not a good idea. 4. With regard to the last problem it is interesting to consider omitting reference counter on the shared data and copy always. It would save another level of indirection but have little impact on the performance if the slices are not freed anyway and save another level of indirection. 5. Benchmark should be done to find out the cutoff length on which this strategy gives performance benefit over blind copying. Please give me your feedback on this idea and feasibility of its becoming a GSoC project. Cheers, Nikolay Zinov nzinov at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From m at magnusmorton.com Mon Mar 14 22:32:45 2016 From: m at magnusmorton.com (Magnus Morton) Date: Tue, 15 Mar 2016 02:32:45 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance Message-ID: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Hi, I?m attempting to use the JitHookInterface to implement something like the PyPy JIT hooks in pycket. However, I?m struggling to do anything other than print information to stdout. From what I understand in pypy, the pypyjit.hooks.pypy_hooks object is instantiated, and then after the ObjSpace is initialised, it is assigned to pypy_hooks.space in setup_after_space_initialization. In my case, when I assign anything to an attribute of my JitHookInterface instance, translation blows up with [translation:ERROR] MissingRTypeAttribute: on_abort [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing [translation:ERROR] .. block at 59 with 2 exits(v1678) [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) If any pycket people are reading this, what I?m trying to do at the moment is give a JitHookInterface instance access to the module table somehow. Copying the pypy JIT hooks approach is not strictly necessary - I?d be happy with being able to update anything from within a JitHookInterface callback which could then be accessed by application level code. Obviously, my understanding of what?s going on here is lacking somewhat. If anyone could point me in the correct general direction, I?d be very grateful. Best regards, Magnus From arigo at tunes.org Tue Mar 15 06:48:01 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 15 Mar 2016 11:48:01 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Message-ID: Hi Magnus, On 15 March 2016 at 03:32, Magnus Morton wrote: > [translation:ERROR] MissingRTypeAttribute: on_abort > [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing > [translation:ERROR] .. block at 59 with 2 exits(v1678) > [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) This says that 'on_abort' is not found. Are you sure you have, like pypy/module/pypyjit/hooks.py, written a JitHookInterface subclass which provides all the same 'on_*' methods? A bient?t, Armin. From m at magnusmorton.com Tue Mar 15 10:45:54 2016 From: m at magnusmorton.com (Magnus Morton) Date: Tue, 15 Mar 2016 14:45:54 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Message-ID: <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Hi Amin, Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. Cheers, Magnus > On 15 Mar 2016, at 10:48, Armin Rigo wrote: > > Hi Magnus, > > On 15 March 2016 at 03:32, Magnus Morton wrote: >> [translation:ERROR] MissingRTypeAttribute: on_abort >> [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing >> [translation:ERROR] .. block at 59 with 2 exits(v1678) >> [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) > > This says that 'on_abort' is not found. Are you sure you have, like > pypy/module/pypyjit/hooks.py, written a JitHookInterface subclass > which provides all the same 'on_*' methods? > > > A bient?t, > > Armin. From arigo at tunes.org Tue Mar 15 11:32:14 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 15 Mar 2016 16:32:14 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Message-ID: Hi Magnus, On 15 March 2016 at 15:45, Magnus Morton wrote: > Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. Can't help, I would need to reproduce the problem first. Please give step-by-step instructions about how to reach that error. Armin From m at magnusmorton.com Tue Mar 15 20:37:14 2016 From: m at magnusmorton.com (Magnus Morton) Date: Wed, 16 Mar 2016 00:37:14 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Message-ID: <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Hi Armin, You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods from pypy.module.pypyjit.hooks import pypy_hooks pypy_hooks.foo = ?foo? What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. Cheers, Magnus > On 15 Mar 2016, at 15:32, Armin Rigo wrote: > > Hi Magnus, > > On 15 March 2016 at 15:45, Magnus Morton wrote: >> Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. > > Can't help, I would need to reproduce the problem first. Please give > step-by-step instructions about how to reach that error. > > > Armin From arigo at tunes.org Wed Mar 16 04:45:56 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 16 Mar 2016 09:45:56 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Message-ID: Hi Magnus, On 16 March 2016 at 01:37, Magnus Morton wrote: > You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods > > from pypy.module.pypyjit.hooks import pypy_hooks > pypy_hooks.foo = ?foo? > > What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. Reproduced and figured it out. Added some docs in eda9fd6a0601: + # WARNING: You should make a single prebuilt instance of a subclass + # of this class. You can, before translation, initialize some + # attributes on this instance, and then read or change these + # attributes inside the methods of the subclass. But this prebuilt + # instance *must not* be seen during the normal annotation/rtyping + # of the program! A line like ``pypy_hooks.foo = ...`` must not + # appear inside your interpreter's RPython code. In PyPy, setup_after_space_initialization() is not RPython (which means it is executed before translation). A bient?t, Armin. From m at magnusmorton.com Wed Mar 16 07:34:55 2016 From: m at magnusmorton.com (Magnus Morton) Date: Wed, 16 Mar 2016 11:34:55 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Message-ID: Hi Armin, Thanks for looking into this. Is this pre-translation code a general thing possible with any RPython based compiler, or is it very PyPy specific? Cheers, Magnus > On 16 Mar 2016, at 08:45, Armin Rigo wrote: > > Hi Magnus, > > On 16 March 2016 at 01:37, Magnus Morton wrote: >> You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods >> >> from pypy.module.pypyjit.hooks import pypy_hooks >> pypy_hooks.foo = ?foo? >> >> What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. > > Reproduced and figured it out. Added some docs in eda9fd6a0601: > > + # WARNING: You should make a single prebuilt instance of a subclass > + # of this class. You can, before translation, initialize some > + # attributes on this instance, and then read or change these > + # attributes inside the methods of the subclass. But this prebuilt > + # instance *must not* be seen during the normal annotation/rtyping > + # of the program! A line like ``pypy_hooks.foo = ...`` must not > + # appear inside your interpreter's RPython code. > > In PyPy, setup_after_space_initialization() is not RPython (which means > it is executed before translation). > > > A bient?t, > > Armin. From fijall at gmail.com Wed Mar 16 07:59:52 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 16 Mar 2016 13:59:52 +0200 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com>

Message-ID: It's general. You can do whatever you like before runtime (during import time for example) as long as the presented world to rpython is static enough - in other words Python is a meta-programming language for RPython On Wed, Mar 16, 2016 at 1:34 PM, Magnus Morton wrote: > Hi Armin, > > Thanks for looking into this. Is this pre-translation code a general thing possible with any RPython based compiler, or is it very PyPy specific? > > Cheers, > Magnus > >> On 16 Mar 2016, at 08:45, Armin Rigo wrote: >> >> Hi Magnus, >> >> On 16 March 2016 at 01:37, Magnus Morton wrote: >>> You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods >>> >>> from pypy.module.pypyjit.hooks import pypy_hooks >>> pypy_hooks.foo = ?foo? >>> >>> What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. >> >> Reproduced and figured it out. Added some docs in eda9fd6a0601: >> >> + # WARNING: You should make a single prebuilt instance of a subclass >> + # of this class. You can, before translation, initialize some >> + # attributes on this instance, and then read or change these >> + # attributes inside the methods of the subclass. But this prebuilt >> + # instance *must not* be seen during the normal annotation/rtyping >> + # of the program! A line like ``pypy_hooks.foo = ...`` must not >> + # appear inside your interpreter's RPython code. >> >> In PyPy, setup_after_space_initialization() is not RPython (which means >> it is executed before translation). >> >> >> A bient?t, >> >> Armin. > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From mount.sarah at gmail.com Wed Mar 16 12:30:04 2016 From: mount.sarah at gmail.com (Sarah Mount) Date: Wed, 16 Mar 2016 16:30:04 +0000 Subject: [pypy-dev] Software benchmarking workshop, April 20, King's College London Message-ID: Dear all, PyPy developers in the UK may be interested in this event on the topic of software benchmarking. Registration will remain open until April 6th. If you have any questions please feel free to email me directly (off list). Best Practices in Software Benchmarking 2016 (#bench16) Wednesday April 20 2016 King's College London http://soft-dev.org/events/bench16/ For computer scientists and software engineers, benchmarking (evaluating the running time of a piece of software, or the performance of a piece of hardware) is a common method for evaluating new techniques. However, there is little agreement on how benchmarking should be carried out, how to control for confounding variables, how to analyse latency data, or how to aid the repeatability of experiments. This free workshop will be a venue for computer scientists and research software engineers to discuss their current best practices and future directions. For further information and free registration please visit: http://soft-dev.org/events/bench16/ Confirmed Speakers: Jan Vitek (Northeastern University) Joe Parker (The Jodrell Laboratory, Royal Botanic Gardens) Simon Taylor (University of Lancaster) Tomas Kalibera (Northeastern University) James Davenport (University of Bath) Edd Barrett (King's College London) Jeremy Bennett (Embecosm) Organizers: Sarah Mount & Laurence Tratt (King's College London) From lists at sonnenglanz.net Wed Mar 16 12:32:05 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Wed, 16 Mar 2016 17:32:05 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

Message-ID: <56E98A85.6000503@sonnenglanz.net> Did the lxml project indicate they will provide a new release soon that incorporates these fixes? I tried to build the latest development code from source, but run into many issues (lxml build server down, source package missing the pre-generated C code etc. etc.), and customer company policy wouldn't allow using a development version in production anyway. The lxml 3.5.0 does not install with pypy-5.0.0 (it used to with pypy-4.0.1, though it was too buggy to be useful), and the lxml-cffi no longer installs. On 08-03-16 15:26, Armin Rigo wrote: > Hi Matti, > > On 8 March 2016 at 15:15, matti picus wrote: >> We could package it and upload as rc1, but version_info will not have rc1 >> unless we rerun the builds. Confusing. >> I prefer to apologize if we get it wrong and release a 5.0.1 bugfix > +1. Go ahead as far as I'm concerned. > > About the release notice: "As a result, lxml with its cython compiled > component passes all tests on PyPy" is not clear until the next > official lxml is released. The current lxml 3.5.0 still contains a > partially buggy workaround that tries to make it work on previous > versions of cpyext. The trunk version at https://github.com/lxml/lxml > has got this code removed, and that's the version that works. > > I'll make the ppc releases once the other releases are out. > > > A bient?t, > > Armin > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From sshakur.shamss at gmail.com Wed Mar 16 12:56:29 2016 From: sshakur.shamss at gmail.com (Shakur Shams) Date: Wed, 16 Mar 2016 22:56:29 +0600 Subject: [pypy-dev] GSoC 2016: Interested to work on the idea :Make bytearray type fast" Message-ID: Hi, I am Shakur Shams Mullick. I would like to participate in GSoC 2016 with PyPy. I have gone through the ideas list and would like to work on the idea to improve bytearray to perform fast ( http://doc.pypy.org/en/latest/project-ideas.html#make-bytearray-type-fast). I would like to work on this but I don't have any prior experience with PyPy. I have work experience as a professional python developer at a startup for about a year and I recently submitted a patch for cpython (not merged yet) and reported a bug. Previously I worked on util-linux also. Because I do not have prior knowledge of PyPy, I am not exactly sure how to implement this idea. That is why I would like to discuss this idea and would like someone to mentor me. Looking forward to your input. Thank you. Best regards, Shakur Shams Mullick -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Wed Mar 16 13:07:31 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 16 Mar 2016 18:07:31 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56E98A85.6000503@sonnenglanz.net> References:

<56E98A85.6000503@sonnenglanz.net> Message-ID: Hi Pim, On 16 March 2016 at 17:32, Pim van der Eijk (Lists) wrote: > Did the lxml project indicate they will provide a new release soon that > incorporates these fixes? You'll have to ask on the lxml mailing list. Armin From lists at sonnenglanz.net Thu Mar 17 11:13:16 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Thu, 17 Mar 2016 16:13:16 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

<56E98A85.6000503@sonnenglanz.net> Message-ID: <56EAC98C.2040905@sonnenglanz.net> There is a new lxml release as of today, unfortunately there is an issue: https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 On 16-03-16 18:07, Armin Rigo wrote: > Hi Pim, > > On 16 March 2016 at 17:32, Pim van der Eijk (Lists) > wrote: >> Did the lxml project indicate they will provide a new release soon that >> incorporates these fixes? > You'll have to ask on the lxml mailing list. > > Armin From arigo at tunes.org Thu Mar 17 12:27:56 2016 From: arigo at tunes.org (Armin Rigo) Date: Thu, 17 Mar 2016 17:27:56 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EAC98C.2040905@sonnenglanz.net> References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: Hi, On 17 March 2016 at 16:13, Pim van der Eijk (Lists) wrote: > There is a new lxml release as of today, unfortunately there is an issue: > https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 Yes, it's what we get when both sides (lxml and pypy) are half-hearted about supporting the other. The lxml tests seem to pass, but that may be because they are small. Many bigger and longer-running processes seem to crash like that. I'm investigating. A bient?t, Armin. From florin.papa at intel.com Fri Mar 18 03:57:35 2016 From: florin.papa at intel.com (Papa, Florin) Date: Fri, 18 Mar 2016 07:57:35 +0000 Subject: [pypy-dev] Refcount garbage collector build error Message-ID: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> Hi all, This is Florin Papa from the Dynamic Scripting Languages Team at Intel Corporation. I am trying to build pypy to use the refcount garbage collector, for testing purposes. I am following the indications here [1], but the following command fails: pypy ../../rpython/bin/rpython -O2 --gc=ref targetpypystandalone with the error: [translation:ERROR] OpErrFmt: [: No module named _weakref] When I run pypy in interactive mode, "import _weakref" works fine. I encounter the same error if I try to use python to run the rpython script. Is the refcount garbage collector still supported? [1] http://doc.pypy.org/en/latest/config/translation.gc.html Regards, Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Fri Mar 18 04:37:21 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 18 Mar 2016 10:37:21 +0200 Subject: [pypy-dev] Refcount garbage collector build error In-Reply-To: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> References: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> Message-ID: Hi Florin The refcount garbage collector is only marginally supported (as far as our tests go), it's definitely neither tested nor really supported when translated, it was always very slow for example. (and as you noticed, there is no support for weakrefs for example) On Fri, Mar 18, 2016 at 9:57 AM, Papa, Florin wrote: > Hi all, > > > > This is Florin Papa from the Dynamic Scripting Languages Team at Intel > Corporation. > > > > I am trying to build pypy to use the refcount garbage collector, for testing > purposes. I am following the indications here [1], but the following command > fails: > > > > pypy ../../rpython/bin/rpython -O2 --gc=ref targetpypystandalone > > > > with the error: > > > > [translation:ERROR] OpErrFmt: [ 0x89a68a8>: No module named _weakref] > > > > When I run pypy in interactive mode, ?import _weakref? works fine. I > encounter the same error if I try to use python to run the rpython script. > Is the refcount garbage collector still supported? > > > > [1] http://doc.pypy.org/en/latest/config/translation.gc.html > > > > Regards, > > Florin > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From arigo at tunes.org Fri Mar 18 07:13:51 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 12:13:51 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: Hi again, On 17 March 2016 at 17:27, Armin Rigo wrote: > On 17 March 2016 at 16:13, Pim van der Eijk (Lists) > wrote: >> There is a new lxml release as of today, unfortunately there is an issue: >> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 > > Yes, it's what we get when both sides (lxml and pypy) are half-hearted > about supporting the other. The lxml tests seem to pass, but that may > be because they are small. Many bigger and longer-running processes > seem to crash like that. I'm investigating. Fixed in 0173cdbbbacc, which then seems to work with lxml even on these larger examples. I'd love some more testing before we do the 5.0.1 bugfix release. Please try with a version of PyPy on the "release-5.x" branch recent enough to contain a09a60a9c381; an Ubuntu precompiled version is here: http://buildbot.pypy.org/nightly/release-5.x/pypy-c-jit-83125-a09a60a9c381-linux64.tar.bz2 A bient?t, Armin. From lists at sonnenglanz.net Fri Mar 18 08:25:23 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Fri, 18 Mar 2016 13:25:23 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: <56EBF3B3.5030903@sonnenglanz.net> Hi, I did some tests and there are no crashes. However, compared to CPython 2.7.10 there are some serious issues: - For my test programs (the script in the issue on BitBucket is derived from one of them), PyPy is much slower. script A: 256 seconds in PyPy versus 78 seconds in CPython script B: 9.73 seconds in PyPy versus 2.6 in Cpython - Memory use continues to grow up to over 80% at which time where my laptop starts swapping, whereas with CPython usage is never more than 4%. - Perhaps caused by the above, there are occasional freezes of several seconds in which nothing seems to happen, although CPU usage is still 100%. Kind Regards, Pim On 18-03-16 12:13, Armin Rigo wrote: > Hi again, > > On 17 March 2016 at 17:27, Armin Rigo wrote: >> On 17 March 2016 at 16:13, Pim van der Eijk (Lists) >> wrote: >>> There is a new lxml release as of today, unfortunately there is an issue: >>> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 >> Yes, it's what we get when both sides (lxml and pypy) are half-hearted >> about supporting the other. The lxml tests seem to pass, but that may >> be because they are small. Many bigger and longer-running processes >> seem to crash like that. I'm investigating. > Fixed in 0173cdbbbacc, which then seems to work with lxml even on > these larger examples. I'd love some more testing before we do the > 5.0.1 bugfix release. Please try with a version of PyPy on the > "release-5.x" branch recent enough to contain a09a60a9c381; an Ubuntu > precompiled version is here: > > http://buildbot.pypy.org/nightly/release-5.x/pypy-c-jit-83125-a09a60a9c381-linux64.tar.bz2 > > > A bient?t, > > Armin. From arigo at tunes.org Fri Mar 18 09:57:15 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 14:57:15 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EBF3B3.5030903@sonnenglanz.net> References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> Message-ID: Hi Pim, On 18 March 2016 at 13:25, Pim van der Eijk (Lists) wrote: > - For my test programs (the script in the issue on BitBucket is derived > from one of them), PyPy is much slower. If you're comparing the speed of scripts that have a large amount of crossings of the cpyext layer (i.e. crossings between Python code and CPython C extension code), then yes, it's expected to be much slower. The speed improved a lot recently, which means it is now *much slower* instead of *very, very much slower*. It makes no sense, now or in the future, to use PyPy in the hope to speed up a script that does _only_ lxml stuff with almost no Python code running in-between. > - Memory use continues to grow up to over 80% at which time where my laptop > starts swapping, whereas with CPython usage is never more than 4%. This is more annoying. Can you give us a way to reproduce this? Armin From lists at sonnenglanz.net Fri Mar 18 10:08:10 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Fri, 18 Mar 2016 15:08:10 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> Message-ID: <56EC0BCA.2080606@sonnenglanz.net> On 18-03-16 14:57, Armin Rigo wrote: >> - Memory use continues to grow up to over 80% at which time where my laptop >> starts swapping, whereas with CPython usage is never more than 4%. > This is more annoying. Can you give us a way to reproduce this? It already happens with the script I attached to the original issue, which you already have: https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 From tobias.oberstein at tavendo.de Fri Mar 18 13:08:21 2016 From: tobias.oberstein at tavendo.de (Tobias Oberstein) Date: Fri, 18 Mar 2016 18:08:21 +0100 Subject: [pypy-dev] Crossbar.io / AutobahnPython 0.13.0 In-Reply-To: <56EC34E6.6070904@gmail.com> References: <56EC34E6.6070904@gmail.com> Message-ID: <56EC3605.7050108@tavendo.de> Hi, we've released Crossbar.io and AutobahnPython 0.13.0, running on Twisted 16.0.0 and PyPy 5.0. Get it here: Source: * https://github.com/crossbario/crossbar * https://github.com/crossbario/autobahn-python Python Packages: * https://pypi.python.org/pypi/crossbar * https://pypi.python.org/pypi/autobahn Binary Packages (recommended) * http://crossbar.io/docs/Local-Installation/ The binary packages contain a complete, self-contained, optimized Crossbar.io with everything - including PyPy 5.0, and of course based on Twisted 16.0.0! These packages are available for Ubuntu, FreeBSD and CentOS. (thanks to Hawkie, Miss Amber Brown - she made that happen;) ) Cheers, /Tobias -------------- next part -------------- A non-text attachment was scrubbed... Name: Pasted image at 2016_03_18 05_41 PM.png Type: image/png Size: 170041 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Bildschirmfoto vom 2016-03-18 17:46:03.png Type: image/png Size: 221883 bytes Desc: not available URL: From arigo at tunes.org Fri Mar 18 13:52:18 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 18:52:18 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EC0BCA.2080606@sonnenglanz.net> References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> <56EC0BCA.2080606@sonnenglanz.net> Message-ID: Hi Pim, On 18 March 2016 at 15:08, Pim van der Eijk (Lists) wrote: >>> - Memory use continues to grow up to over 80% at which time where my >>> laptop >>> starts swapping, whereas with CPython usage is never more than 4%. >> >> This is more annoying. Can you give us a way to reproduce this? > > It already happens with the script I attached to the original issue, which > you already have: > https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 Ok, partially reproduced. With CPython it grows continously too, but only up to 1.2GB and then it finishes. With PyPy it grows faster up to 22GB. If I add some "gc.collect()" executed every few seconds, then PyPy only grows up to 1.7GB. I added "add_memory_pressure=True" to some chosen mallocs inside cpyext, and it seems to be enough to fix the problem. Now PyPy grows up to 1.7GB even without any gc.collect(). Yay! (changeset 9137853fd0ec, grafted to release-5.x too) A bient?t, Armin. From lists at sonnenglanz.net Sun Mar 20 04:20:59 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Sun, 20 Mar 2016 09:20:59 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References:

<56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> <56EC0BCA.2080606@sonnenglanz.net> Message-ID: <56EE5D6B.4090701@sonnenglanz.net> Hi Armin, On 18-03-16 18:52, Armin Rigo wrote: > Hi Pim, > > On 18 March 2016 at 15:08, Pim van der Eijk (Lists) > wrote: >>>> - Memory use continues to grow up to over 80% at which time where my >>>> laptop >>>> starts swapping, whereas with CPython usage is never more than 4%. >>> This is more annoying. Can you give us a way to reproduce this? >> It already happens with the script I attached to the original issue, which >> you already have: >> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 > Ok, partially reproduced. With CPython it grows continously too, but > only up to 1.2GB and then it finishes. With PyPy it grows faster up > to 22GB. If I add some "gc.collect()" executed every few seconds, > then PyPy only grows up to 1.7GB. > > I added "add_memory_pressure=True" to some chosen mallocs inside > cpyext, and it seems to be enough to fix the problem. Now PyPy grows > up to 1.7GB even without any gc.collect(). Yay! (changeset > 9137853fd0ec, grafted to release-5.x too) > I retested and confirm that the library works and memory use is now like CPython, which is great. It is still slower than CPython, for reasons you explained before, but that is because my test script heavily uses of lxml. In larger applications where lxml processing is a smaller part of the overall functionality, the PyPy speed-up of regular Python code could well compensate for this. Many thanks, Pim > A bient?t, > > Armin. From tinchester at gmail.com Sun Mar 20 21:43:05 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=C4=87?=) Date: Mon, 21 Mar 2016 01:43:05 +0000 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question Message-ID: Hello, first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not demanding free labor here, just curious whether I should wait a little for 5.0 to show up there or change my Dockerfiles to direct download. second question: does PyPy support PyByteArray_CheckExact? I seem to have some Cython-generated code using it and PyPy seems to be refusing to import the resulting module. Cheers! -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 21 03:53:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 21 Mar 2016 09:53:23 +0200 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References: Message-ID: PPA is usually updated, but as you said we can't demand deadlines PyByteArray_Check and PyByteArray_CheckExact are not implemented On Mon, Mar 21, 2016 at 3:43 AM, Tin Tvrtkovi? wrote: > Hello, > > first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not > demanding free labor here, just curious whether I should wait a little for > 5.0 to show up there or change my Dockerfiles to direct download. > > second question: does PyPy support PyByteArray_CheckExact? I seem to have > some Cython-generated code using it and PyPy seems to be refusing to import > the resulting module. > > Cheers! > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From tinchester at gmail.com Mon Mar 21 05:42:01 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=C4=87?=) Date: Mon, 21 Mar 2016 10:42:01 +0100 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References:

Message-ID: Thanks for the quick reply (as always). We'll stick with the PPA. About PyByteArray_CheckExact, any chance of it getting implemented in this next round of C-API extensions? Looking in the CPython source, it seems to be a one-line macro: #define PyByteArray_CheckExact(self) (Py_TYPE(self) == &PyByteArray_Type) but I admit to knowing basically nothing about this level of code. :) I figure asking here whether it can be implemented will be better than asking Cython to stop using it ;) Cheers! On Mon, Mar 21, 2016 at 8:53 AM, Maciej Fijalkowski wrote: > PPA is usually updated, but as you said we can't demand deadlines > > PyByteArray_Check and PyByteArray_CheckExact are not implemented > > On Mon, Mar 21, 2016 at 3:43 AM, Tin Tvrtkovi? > wrote: > > Hello, > > > > first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not > > demanding free labor here, just curious whether I should wait a little > for > > 5.0 to show up there or change my Dockerfiles to direct download. > > > > second question: does PyPy support PyByteArray_CheckExact? I seem to have > > some Cython-generated code using it and PyPy seems to be refusing to > import > > the resulting module. > > > > Cheers! > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Mon Mar 21 05:59:26 2016 From: matti.picus at gmail.com (Matti Picus) Date: Mon, 21 Mar 2016 11:59:26 +0200 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References:

Message-ID: <56EFC5FE.3030408@gmail.com> On 21/03/16 11:42, Tin Tvrtkovi? wrote: > Thanks for the quick reply (as always). > > We'll stick with the PPA. > > About PyByteArray_CheckExact, any chance of it getting implemented in > this next round of C-API extensions? Looking in the CPython source, it > seems to be a one-line macro: > > #define PyByteArray_CheckExact(self) (Py_TYPE(self) == &PyByteArray_Type) > > but I admit to knowing basically nothing about this level of code. :) > I figure asking here whether it can be implemented will be better than > asking Cython to stop using it ;) > > Cheers! > > mailing list > > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev While true, that would only get you to the next step, which is that much of the functionality of PyByteArray_Type is not implemented. See for instance the functions in cpyext/stubs.py or commit 16f119c9be67 which added a failing test for PyArg_ParseTuple, s*, and ByteArrays. If we were to push the CheckExact forward, what functionality is critical for cython to completely compile your module? Matti From kunalgrover05 at gmail.com Mon Mar 21 10:29:22 2016 From: kunalgrover05 at gmail.com (Kunal Grover) Date: Mon, 21 Mar 2016 19:59:22 +0530 Subject: [pypy-dev] STM improvements GSoC project Message-ID: Hi, I am interested in improvements in PyPy-STM as a GSoC project. I have discussed some ideas with Remi, and put them down here in https://docs.google.com/document/d/1ZXORu2qgX6EixCWTb--HRMIFYauoWtJIGKkTlFi8DuY/edit . It would be great if you could comment here giving your suggestions regarding the same. Also, I am unsure about how to make vmprof work with this STM, and what is the complexity involved in that. Anyone can give suggestions about the same? Thank you. Kunal -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Mar 22 02:41:02 2016 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 21 Mar 2016 23:41:02 -0700 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year Message-ID: Hi all, I wanted to announce a workshop I'm organizing at SciPy this year, and invite you to attend! What: A two-day workshop bringing together folks working on JIT/AOT compilation in Python. When/where: July 11-12, in Austin, Texas. (This is co-located with SciPy 2016, at the same time as the tutorial sessions, just before the conference proper.) Website: https://python-compilers-workshop.github.io/ Note that I anticipate that we'll be able to get sponsorship funding to cover travel costs for folks who can't get their employers to foot the bill. Cheers, -n -- Nathaniel J. Smith -- https://vorpus.org From bg379 at cornell.edu Tue Mar 22 14:56:23 2016 From: bg379 at cornell.edu (Brian Guo) Date: Tue, 22 Mar 2016 14:56:23 -0400 Subject: [pypy-dev] GSoC: Updates on ByteArray? Message-ID: Hi, My name is Brian Guo and I am currently an undergraduate at Cornell University. I am very interested in working with PyPy as part of Google's Summer of Code. In particular, I am interested in working on the bytearray project. I noticed that the current status of the ByteArray project is unknown, but that there may be updates on the mailing list. I am wondering if there is any information I may be able to read on this project, or possibly an overview of the project itself and the proposed changes that would make byteArray faster (if any have been proposed yet). I am very grateful to anyone who is able to point me in the right direction in regards to this project. Thank you all for your time, -Brian Guo -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 22 15:36:25 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 22 Mar 2016 21:36:25 +0200 Subject: [pypy-dev] GSoC: Updates on ByteArray? In-Reply-To: References: Message-ID: Hi Brian bytearray should be optimized for cases where you e.g. write() it to file or use read_into() in a way that does not make any copies. Same if you say convert it from ffi.buffer etc. That's probably what's missing from making it fast On Tue, Mar 22, 2016 at 8:56 PM, Brian Guo wrote: > Hi, > > My name is Brian Guo and I am currently an undergraduate at Cornell > University. I am very interested in working with PyPy as part of Google's > Summer of Code. In particular, I am interested in working on the bytearray > project. I noticed that the current status of the ByteArray project is > unknown, but that there may be updates on the mailing list. I am wondering > if there is any information I may be able to read on this project, or > possibly an overview of the project itself and the proposed changes that > would make byteArray faster (if any have been proposed yet). I am very > grateful to anyone who is able to point me in the right direction in regards > to this project. > > Thank you all for your time, > > -Brian Guo > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From john.m.camara at gmail.com Wed Mar 23 14:16:37 2016 From: john.m.camara at gmail.com (John Camara) Date: Wed, 23 Mar 2016 14:16:37 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year Message-ID: Hi Nathaniel, I would like to suggest one more topic for the workshop. I see a big need for a library (jffi) similar to cffi but that provides a bridge to Java instead of C code. The ability to seamlessly work with native Java data/code would offer a huge improvement when python code needs to work with the Spark/Hadoop ecosystem. The current mechanisms which involve serializing data to/from Java can kill performance for some applications and can render Python unsuitable for these cases. John -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Wed Mar 23 14:47:46 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 23 Mar 2016 20:47:46 +0200 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: