From planrichi at gmail.com Wed Mar 2 08:27:56 2016 From: planrichi at gmail.com (Richard Plangger) Date: Wed, 2 Mar 2016 14:27:56 +0100 Subject: [pypy-dev] GSoC 2016 Message-ID: <56D6EA5C.4050408@gmail.com> Hi, I was wondering who applied as a sub org to python last year? The registration for new sub orgs is open until March 7th. (https://wiki.python.org/moin/SummerOfCode/2016#Sub-orgs) As we discussed on the sprint I will try to attract some students tomorrow at the university in Vienna. Of course I'm also willing to mentor if there is a good proposal. Cheers, Richard -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: OpenPGP digital signature URL: From fijall at gmail.com Wed Mar 2 09:09:00 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 2 Mar 2016 15:09:00 +0100 Subject: [pypy-dev] GSoC 2016 In-Reply-To: <56D6EA5C.4050408@gmail.com> References: <56D6EA5C.4050408@gmail.com> Message-ID: Hi Richard As discussed on the sprint I applied (but have yet to receive a confirmation) On Wed, Mar 2, 2016 at 2:27 PM, Richard Plangger wrote: > Hi, > > I was wondering who applied as a sub org to python last year? > The registration for new sub orgs is open until March 7th. > (https://wiki.python.org/moin/SummerOfCode/2016#Sub-orgs) > > As we discussed on the sprint I will try to attract some students > tomorrow at the university in Vienna. Of course I'm also willing to > mentor if there is a good proposal. > > Cheers, > Richard > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From edd at theunixzoo.co.uk Thu Mar 3 12:16:10 2016 From: edd at theunixzoo.co.uk (Edd Barrett) Date: Thu, 3 Mar 2016 17:16:10 +0000 Subject: [pypy-dev] CFP: ICOOOLPS'16: Workshop on Implementation, Compilation, Optimization of OO Languages, Programs and Systems Message-ID: <20160303171610.GG14101@wilfred.home> May be of interest to some of the members on this list: Call for Papers: ICOOOLPS?16 ============================ 11th Workshop on Implementation, Compilation, Optimization of OO Languages, Programs and Systems Co-located with ECOOP July 18, 2016, Rome, Italy URL: http://2016.ecoop.org/track/ICOOOLPS-2016 Twitter: @ICOOOLPS The ICOOOLPS workshop series brings together researchers and practitioners working in the field of language implementation and optimization. The goal of the workshop is to discuss emerging problems and research directions as well as new solutions to classic performance challenges. The topics of interest for the workshop include techniques for the implementation and optimization of a wide range of languages including but not limited to object-oriented ones. Furthermore, meta-compilation techniques or language-agnostic approaches are welcome, too. A non-exclusive list of topics follows: - implementation and optimization of fundamental languages features (from automatic memory management to zero-overhead metaprogramming) - runtime systems technology (libraries, virtual machines) - static, adaptive, and speculative optimizations and compiler techniques - meta-compilation techniques and language-agnostic approaches for the efficient implementation of languages - compilers (intermediate representations, offline and online optimizations,...) - empirical studies on language usage, benchmark design, and benchmarking methodology - resource-sensitive systems (real-time, low power, mobile, cloud) - studies on design choices and tradeoffs (dynamic vs. static compilation, heuristics vs. programmer input,...) - tooling support, debuggability and observability of languages as well as their implementations ### Workshop Format and Submissions This workshop welcomes the presentation and discussion of new ideas and emerging problems that give a chance for interaction and exchange. More mature work is welcome as part of a mini-conference format, too. We aim to interleave interactive brainstorming and demonstration sessions between the formal presentations to foster an active exchange of ideas. The workshop papers will be published either in the ACM DL or in the Dagstuhl LIPIcs ECOOP Workshop proceedings. Until further notice, please use the ACM SIGPLAN template with a 10pt font size: http://www.sigplan.org/Resources/Author/ - position and work-in-progress paper: 1-4 pages - technical paper: max. 10 pages - demos and posters: 1-page abstract For the submission, please use the HotCRP system: http://ssw.jku.at/icooolps/ ### Important Dates - abstract submission: April 11, 2016 - paper submission: April 15, 2016 - notification: May 13, 2016 - all deadlines: Anywhere on Earth (AoE), i.e., GMT/UTC?12:00 hour - workshop: July 18th, 2016 ### Program Committee Edd Barrett, King?s College London, UK Clement Bera, Inria Lille, France Maxime Chevalier-Boisvert, Universit? de Montr?al, Canada Tim Felgentreff, Hasso Plattner Institute, Germany Roland Ducournau, LIRMM, Universit? de Montpellier, France Elisa Gonzalez Boix, Vrije Universiteit Brussel, Belgium David Gregg, Trinity College Dublin, Ireland Matthias Grimmer, Johannes Kepler University Linz, Austria Michael Haupt, Oracle, Germany Richard Jones, University of Kent, UK Tomas Kalibera, Northeastern University, USA Hidehiko Masuhara, Tokyo Institute of Technology, Japan Tiark Rompf, Purdue University, USA Jennifer B. Sartor, Ghent University, Belgium Sam Tobin-Hochstadt, Indiana University, USA ### Workshop Organizers Stefan Marr, Johannes Kepler University Linz, Austria Eric Jul, University of Oslo, Norway For questions or concerns, please mail to stefan.marr at jku.at or contact us via https://twitter.com/icooolps. -- Best Regards Edd Barrett http://www.theunixzoo.co.uk From piotr.jerzy.jurkiewicz at gmail.com Fri Mar 4 19:48:41 2016 From: piotr.jerzy.jurkiewicz at gmail.com (Piotr Jurkiewicz) Date: Sat, 5 Mar 2016 01:48:41 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage Message-ID: <56DA2CE9.5070409@gmail.com> Hi PyPy devs, my name is Piotr Jurkiewicz and I am a first-year PhD student at the AGH University of Science and Technology, Krak?w, Poland. I am writing this email to make sure that PyPy is going to participate in GSoC 2016, since I am interested in one of the proposed projects: Optimized Unicode Representation Below is a list of my ideas and plan for the project. (I use Python 2 nomenclature, that is unicode strings are `unicode` objects and bytes strings are `str` objects.) 1. Store all unicode objects contents internally as UTF-8. This would reduce size of stored contents and allow external libraries, which expect UTF-8, to process contents directly in the memory (for example using various regexp libraries to search unicode string). 2. Unify interning caches for str and unicode. This would allow unicode objects and corresponding utf8-encoded-str objects to share the same interned buffer. For example unicode object u'ko?' would share interned buffer with str 'ko\xc5\x84'. This would make unicode.encode('utf-8') basically no op. As UTF-8 becomes dominant encoding for any data exchange, including web (86%) [1], more and more data coming out from Python scripts needs to be UTF-8 encoded. Therefore, it is important to make this operation as cheap as possible. It would speed up str.decode('utf-8') significantly too, although it wouldn't make it no op. String still would need to be checked if it is a correct UTF-8 string when transforming to unicode object. But we can get rid of additional allocation, copying string contents and storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE. 3. Indexing of codepoints positions, what would allow O(1) random access and slicing. The idea is simple: alongside contents of each interned unicode object, store an array of unsigned integers. These integers will be positions (in bytes), counting from the beginning of the buffer, at which each next 64-codepoint-long 'pages' start. Random access would be as follows: page_num, byte_in_page = divmod(codepoint_pos, 64) page_start_byte = index[page_num] exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) return buffer[exact_byte] Using 64-byte long pages, like in the example above, would allow O(1) random access, with constant terms of: - one cache access in cases of only-ASCII texts (indexes for such unicode objects will not be created and maintained) - three cache accesses in cases of texts consisting of ASCII mixed with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic alphabets) - four or five cache accesses in cases of texts consisting mostly of three- and four- byte characters (all above assuming 64-byte long CPU cache lines) Memory overhead associated with storing index array would be in range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32 codepoints will be allowed) (assuming that the index array consists of integers of smallest possible type which can store buffer_bytes_len - 1) 4. Fast codepoints counting/seeking with branchless algorithm [2]. When unicode object is interned, we are sure that it is a correct UTF-8 string. Therefore, there is no need for correctness checking when seeking, so a branchless algorithm can be used. [1]: http://w3techs.com/technologies/details/en-utf8/all/all [2]: http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html All of these changes can be introduced one at a time, what would improve tracking of performance changes and debugging of eventual errors. After completing the project I plan to write a paper describing speedup method of random access unicode access based on indexing, as this method has a potential for being used in other language interpreters which have immutable and/or interned unicode strings. Note that similar index can be created for graphemes as well, so this method can be used in languages which provide grapheme-based interface (like Perl 6). Please share your thoughts about these ideas. Cheers, Piotr From arigo at tunes.org Sat Mar 5 03:09:59 2016 From: arigo at tunes.org (Armin Rigo) Date: Sat, 5 Mar 2016 09:09:59 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DA2CE9.5070409@gmail.com> References: <56DA2CE9.5070409@gmail.com> Message-ID: Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal! On 5 March 2016 at 01:48, Piotr Jurkiewicz wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bient?t, Armin. From matti.picus at gmail.com Sat Mar 5 16:17:47 2016 From: matti.picus at gmail.com (Matti Picus) Date: Sat, 5 Mar 2016 23:17:47 +0200 Subject: [pypy-dev] Release 5.0.0 Message-ID: <56DB4CFB.6020007@gmail.com> Pre-release bundles are up on the buildbot, http://buildbot.pypy.org/nightly/release-5.x please test them out. There are still a few last touches pending, but it would be nice to have some preliminary indication whether the bundles work in real-life work loads and bugs that we claim we fixed since 4.0.1 actually do not reappear. Also the release notice is up at https://bitbucket.org/pypy/pypy/src/default/pypy/doc/release-5.0.0.rst Any help with it would be appreciated Matti From yury at shurup.com Sat Mar 5 16:33:31 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Sat, 5 Mar 2016 22:33:31 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: <56DB4CFB.6020007@gmail.com> References: <56DB4CFB.6020007@gmail.com> Message-ID: On Sat, 5 Mar 2016, Matti Picus wrote: > Pre-release bundles are up on the buildbot, > http://buildbot.pypy.org/nightly/release-5.x please test them out. Hi Matti, So did you figure out the mysterious memory consumption issues that we have experienced while trying to upgrade the Windows builder to a more recent version of PyPy? Do you think it would make sense to retry the upgrade after PyPy 5.0.0 is out? -- Sincerely yours, Yury V. Zaytsev From tinchester at gmail.com Sat Mar 5 17:20:49 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=c4=87?=) Date: Sat, 5 Mar 2016 23:20:49 +0100 Subject: [pypy-dev] Making Pyrasite work with PyPy Message-ID: <56DB5BC1.90601@gmail.com> Hello, in case you haven't heard of it, Pyrasite (https://github.com/lmacken/pyrasite) is a tool for injecting code into running Python processes. Personally I have found it invaluable for forensics on services running in production and have successfully solved memory leaks, connection leaks and deadlocks with it. One of the payloads provided will open a remote REPL right in a running process, without the process having *any* preparation logic in it. I think this is extremely powerful and makes Python catch and up even surpass Java (which has automatic stack trace dumping on SIGQUIT and useful tools like JConsole and VisualVM that can connect to running processes, again by default with no setup in the process) for these kinds of things. Anyway, Pyrasite uses gdb under the hood; gdb will attach to a running process and inject the following: gdb_cmds = [ 'PyGILState_Ensure()', 'PyRun_SimpleString("' 'import sys; sys.path.insert(0, \\"%s\\"); ' 'sys.path.insert(0, \\"%s\\"); ' 'exec(open(\\"%s\\").read())")' % (os.path.dirname(filename), os.path.abspath(os.path.join(os.path.dirname(__file__), '..')), filename), 'PyGILState_Release($1)', ] If I change the Py* functions to PyPy* (PyRun_SimpleString to PyPyRun_SimpleString), this seems to work just fine on PyPy too. This is great, and now I'd like to contribute back to Pyrasite and get PyPy support in there. It'd be great if Pyrasite could automatically detect if the underlying process is CPython or PyPy, so since my experience working on the C level is very basic, I'm asking you, the PyPy devs, if there's a good way of detecting a process is PyPy given its PID and gdb's ability of attaching to a process and doing gdb things. Worst case scenario, gdb supports "info functions", which is how I found the PyPy functions in the first place, but is there a better way? I apologize if this is off-topic for PyPy-dev. From fijall at gmail.com Sun Mar 6 02:03:32 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 09:03:32 +0200 Subject: [pypy-dev] Making Pyrasite work with PyPy In-Reply-To: <56DB5BC1.90601@gmail.com> References: <56DB5BC1.90601@gmail.com> Message-ID: Hi Tin This is very much on topic for pypy-dev. One obvious solution would be to check for the existance of symbols in gdb (if there is a symbol called PyPyRun_SimpleString, then obviously you're running on PyPy). I'm not sure how to express it under gdb, but there must be a way On Sun, Mar 6, 2016 at 12:20 AM, Tin Tvrtkovi? wrote: > Hello, > > in case you haven't heard of it, Pyrasite > (https://github.com/lmacken/pyrasite) is a tool for injecting code into > running Python processes. Personally I have found it invaluable for > forensics on services running in production and have successfully solved > memory leaks, connection leaks and deadlocks with it. One of the > payloads provided will open a remote REPL right in a running process, > without the process having *any* preparation logic in it. I think this > is extremely powerful and makes Python catch and up even surpass Java > (which has automatic stack trace dumping on SIGQUIT and useful tools > like JConsole and VisualVM that can connect to running processes, again > by default with no setup in the process) for these kinds of things. > > Anyway, Pyrasite uses gdb under the hood; gdb will attach to a running > process and inject the following: > > gdb_cmds = [ > 'PyGILState_Ensure()', > 'PyRun_SimpleString("' > 'import sys; sys.path.insert(0, \\"%s\\"); ' > 'sys.path.insert(0, \\"%s\\"); ' > 'exec(open(\\"%s\\").read())")' % > (os.path.dirname(filename), > os.path.abspath(os.path.join(os.path.dirname(__file__), > '..')), > filename), > 'PyGILState_Release($1)', > ] > > If I change the Py* functions to PyPy* (PyRun_SimpleString to > PyPyRun_SimpleString), this seems to work just fine on PyPy too. > > This is great, and now I'd like to contribute back to Pyrasite and get > PyPy support in there. It'd be great if Pyrasite could automatically > detect if the underlying process is CPython or PyPy, so since my > experience working on the C level is very basic, I'm asking you, the > PyPy devs, if there's a good way of detecting a process is PyPy given > its PID and gdb's ability of attaching to a process and doing gdb > things. Worst case scenario, gdb supports "info functions", which is how > I found the PyPy functions in the first place, but is there a better way? > > I apologize if this is off-topic for PyPy-dev. > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From yury at shurup.com Sun Mar 6 05:04:10 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Sun, 6 Mar 2016 11:04:10 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> Message-ID: On Sat, 5 Mar 2016, Yury V. Zaytsev wrote: > On Sat, 5 Mar 2016, Matti Picus wrote: > > So did you figure out the mysterious memory consumption issues that we have > experienced while trying to upgrade the Windows builder to a more recent > version of PyPy? Do you think it would make sense to retry the upgrade after > PyPy 5.0.0 is out? So, it looks like with PyPy 5.0.0 the problem is exactly the same as with the previous version. The translation goes through (and possibily faster / uses less memory, I didn't check), but the compilation bails out with a `MemoryError` at `buffer.append(fh.read())`: http://buildbot.pypy.org/builders/pypy-c-jit-win-x86-32/builds/2266/steps/translate/logs/stdio That's definitively not my fault, I've done my `editbin /largeaddressaware` dance and confirmed its effects with `dumpbin /headers`. In the mean time, I rolled back to PyPy 2.5.1 on the build slave. Oh wait, I meant to say build follower. Sorry about this. -- Sincerely yours, Yury V. Zaytsev From matti.picus at gmail.com Sun Mar 6 16:01:30 2016 From: matti.picus at gmail.com (Matti Picus) Date: Sun, 6 Mar 2016 23:01:30 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> Message-ID: <56DC9AAA.1080003@gmail.com> An HTML attachment was scrubbed... URL: From fijall at gmail.com Sun Mar 6 16:18:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Sun, 6 Mar 2016 23:18:23 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: <56DC9AAA.1080003@gmail.com> References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: It uses subprocess, but you need to quit pypy (so run this with --source and then make separately) for memory to be reclaimed On Sun, Mar 6, 2016 at 11:01 PM, Matti Picus wrote: > > > On 06/03/16 12:04, Yury V. Zaytsev wrote: > > On Sat, 5 Mar 2016, Yury V. Zaytsev wrote: > > So, it looks like with PyPy 5.0.0 the problem is exactly the same as with > the previous version. The translation goes through (and possibily faster / > uses less memory, I didn't check), but the compilation bails out with a > `MemoryError` at `buffer.append(fh.read())`: > > http://buildbot.pypy.org/builders/pypy-c-jit-win-x86-32/builds/2266/steps/translate/logs/stdio > > In the mean time, I rolled back to PyPy 2.5.1 on the build slave. Oh wait, I > meant to say build follower. Sorry about this. > > I watched the compile part of translation in a system monitor on a local VM. > Using the pypy 5.0 release, during compilation there is a single pypy.exe > process requiring about 2.8GB of memory. At some point, toward the end of > compiling the 1000+ source files (perhaps during link?) memory consumption > jumps way up, trying to access at least another GB of memory, at which point > the virtual machine complains and the pypy.exe crashes. Any ideas? I thought > the compile step uses multiprocessing to run in a seperate process, but it > seems I am wrong. > Matti > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From hubo at jiedaibao.com Mon Mar 7 02:58:14 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 15:58:14 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> Message-ID: <56DD3493.8020800@jiedaibao.com> I think it is not reasonable to use UTF-8 to represent the unicode string type. 1. Less storage - this is not always true. It is only true for strings with a lot of ASCII characters. In Asia, most strings in local languages (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more storage than in UTF-16. To make things worse, while it always consumes 2*N bytes for a N-characters string in UTF-16, it is difficult to estimate the size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) (UTF-16 also has two-word characters, but len() reports 2 for these characters, I think it is not harmful to treat them as two characters) 2. There would be very complicate logics for size calculating and slicing. For UTF-16, every character is represented with a 16-bit integer, so it is convient for size calculating and slicing. But character in UTF-8 consumes variant bytes, so either we call mb_* string functions instead (which is slow in nature) or we use special logic like storing indices of characters in another array (which introduces cost for extra addressings). 3. When displaying with repr(), non-ASCII characters are displayed with \uXXXX format. If the internal storage for unicode is UTF-8, the only way to be compatible with this format is to convert it back to UTF-16. It may be wiser to let programmers deside which encoding they would like to use. If they want to process UTF-8 strings without performance cost on converting, they should use "bytes". When correct size calculating and slicing of non-ASCII characters are concerned it may be better to use "unicode". 2016-03-07 hubo ????Armin Rigo ?????2016-03-05 16:09 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"Piotr Jurkiewicz" ???"PyPy Developer Mailing List" Hi Piotr, Thanks for giving some serious thoughts to the utf8-stored unicode string proposal! On 5 March 2016 at 01:48, Piotr Jurkiewicz wrote: > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] This is the part I'm least sure about: seek_forward() needs to be a loop over 0 to 63 codepoints. True, each loop can be branchless, and very short---let's say 4 instructions. But it still makes a total of up to 252 instructions (plus the checks to know if we must go on). These instructions are all or almost all dependent on the previous one: you must have finished computing the length of one sequence to even being computing the length of the next one. Maybe it's faster to use a more "XMM-izable" algorithm which counts 0 for each byte in 0x80-0xBF and 1 otherwise, and makes the sum. There are also variants, e.g. adding a second array of words similar to 'index', but where each word is 8 packed bytes giving 8 starting points inside the page (each in range 0-252). This would reduce the walk to 0-7 codepoints. I'm +1 on your proposal. The whole thing is definitely worth a try. A bient?t, Armin. _______________________________________________ pypy-dev mailing list pypy-dev at python.org https://mail.python.org/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 7 03:46:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 10:46:23 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD3493.8020800@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: Hi hubo. I think you're slightly confusing two things. UTF-16 is a variable-length encoding that has two-word characters that *has to* return "1" for len() of those. UCS-2 seems closer to what you described (which is a fixed-width encoding), but can't encode all the unicode characters and as such is unsuitable for a modern unicode representation. I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the slicing and size calculations still has to be as complicated as for UTF-8. Complicated logic in repr() - those are not usually performance critical parts of your program and it's ok to have some complications there. It's true that UTF-16 can be less efficient than UTF-8 for certain languages, however both are more memory efficient than what we currently use (UCS4). There are however some problems - even if you work exclusively in, say, korean, for example web servers still have to deal with some parts that are ascii (html markup, css etc.) while handling text in korean. In those cases UTF8 vs UTF16 is more muddled and the exact details depend a lot. We also need to consider the fact that we ship one canonical PyPy to everybody - people using different languages and different encodings. Overall, UTF8 seems like definitely a better alternative than UCS4 (also for asian languages), which is what we are using now and I would be inclined to leave UTF16 as an option to see if it performs better for certain benchmarks. Best regards, Maciej Fijalkowski On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode string > type. > > > 1. Less storage - this is not always true. It is only true for strings with > a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more > storage than in UTF-16. To make things worse, while it always consumes 2*N > bytes for a N-characters string in UTF-16, it is difficult to estimate the > size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) > > 2. There would be very complicate logics for size calculating and slicing. > For UTF-16, every character is represented with a 16-bit integer, so it is > convient for size calculating and slicing. But character in UTF-8 consumes > variant bytes, so either we call mb_* string functions instead (which is > slow in nature) or we use special logic like storing indices of characters > in another array (which introduces cost for extra addressings). > > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only way to > be compatible with this format is to convert it back to UTF-16. > > It may be wiser to let programmers deside which encoding they would like to > use. If they want to process UTF-8 strings without performance cost on > converting, they should use "bytes". When correct size calculating and > slicing of non-ASCII characters are concerned it may be better to use > "unicode". > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Armin Rigo > ?????2016-03-05 16:09 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"Piotr Jurkiewicz" > ???"PyPy Developer Mailing List" > > Hi Piotr, > > Thanks for giving some serious thoughts to the utf8-stored unicode > string proposal! > > On 5 March 2016 at 01:48, Piotr Jurkiewicz > wrote: >> Random access would be as follows: >> >> page_num, byte_in_page = divmod(codepoint_pos, 64) >> page_start_byte = index[page_num] >> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >> return buffer[exact_byte] > > This is the part I'm least sure about: seek_forward() needs to be a > loop over 0 to 63 codepoints. True, each loop can be branchless, and > very short---let's say 4 instructions. But it still makes a total of > up to 252 instructions (plus the checks to know if we must go on). > These instructions are all or almost all dependent on the previous > one: you must have finished computing the length of one sequence to > even being computing the length of the next one. Maybe it's faster to > use a more "XMM-izable" algorithm which counts 0 for each byte in > 0x80-0xBF and 1 otherwise, and makes the sum. > > There are also variants, e.g. adding a second array of words similar > to 'index', but where each word is 8 packed bytes giving 8 starting > points inside the page (each in range 0-252). This would reduce the > walk to 0-7 codepoints. > > I'm +1 on your proposal. The whole thing is definitely worth a try. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From cfbolz at gmx.de Mon Mar 7 03:48:53 2016 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Mon, 7 Mar 2016 09:48:53 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD3493.8020800@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: <56DD4075.6090201@gmx.de> Hi, On 07/03/16 08:58, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode > string type. > 1. Less storage - this is not always true. It is only true for strings > with a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume > more storage than in UTF-16. To make things worse, while it always > consumes 2*N bytes for a N-characters string in UTF-16, it is difficult > to estimate the size of a N-characters string in UTF-8 (may be N bytes > to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) Note that in PyPy unicode strings use UTF-32 as the internal representation for all platforms, so the space saving would be larger. Note also that currently almost all I/O operations on many platforms do a conversion from UTF-8 to UTF-32 and back, which involves a copy and is costly. > 2. There would be very complicate logics for size calculating and > slicing. For UTF-16, every character is represented with a 16-bit > integer, so it is convient for size calculating and slicing. But > character in UTF-8 consumes variant bytes, so either we call mb_* string > functions instead (which is slow in nature) or we use special logic like > storing indices of characters in another array (which introduces cost > for extra addressings). This is true, some engineering would have to go into this part of the representation. > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only > way to be compatible with this format is to convert it back to UTF-16. > It may be wiser to let programmers deside which encoding they would like > to use. If they want to process UTF-8 strings without performance cost > on converting, they should use "bytes". When correct size calculating > and slicing of non-ASCII characters are concerned it may be better to > use "unicode". I think repr is allowed to be a somewhat slow operation. Cheers, Carl Friedrich From yury at shurup.com Mon Mar 7 03:55:50 2016 From: yury at shurup.com (Yury V. Zaytsev) Date: Mon, 7 Mar 2016 09:55:50 +0100 (CET) Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: > It uses subprocess, but you need to quit pypy (so run this with --source > and then make separately) for memory to be reclaimed Do you think that pre-forking a process for compilation right at the beginning of the translation when PyPy hasn't consumed much memory yet would be a viable solution? I think if this is practical, it would be a much user friendlier solution as compared to two-step process (translation + compilation). If memory serves me well, this is one of the strategies that subprocess in Python 3 is using to improve on memory consumption. -- Sincerely yours, Yury V. Zaytsev From fijall at gmail.com Mon Mar 7 04:16:42 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:16:42 +0200 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: I have no idea how memory management works on windows (I doubt this will solve it), but this is how we do that on linux On Mon, Mar 7, 2016 at 10:55 AM, Yury V. Zaytsev wrote: > On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: > >> It uses subprocess, but you need to quit pypy (so run this with --source >> and then make separately) for memory to be reclaimed > > > Do you think that pre-forking a process for compilation right at the > beginning of the translation when PyPy hasn't consumed much memory yet would > be a viable solution? > > I think if this is practical, it would be a much user friendlier solution as > compared to two-step process (translation + compilation). If memory serves > me well, this is one of the strategies that subprocess in Python 3 is using > to improve on memory consumption. > > > -- > Sincerely yours, > Yury V. Zaytsev From hubo at jiedaibao.com Mon Mar 7 04:21:17 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 17:21:17 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> Message-ID: <56DD480B.2070709@jiedaibao.com> Yes, there are two-words characters in UTF-16, as I mentioned. But len() in CPython returns 2 for these characters (even if they are correctly processed in repr()): >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' u'\U00011409' (Python 3.x seems to have removed the display processing) Maybe it is better to be compatible with CPython in these situations. Since two-words characters are really rare in Unicode strings, programmers may not know their existence and allocate exactly 2 * len(s) bytes for storing an unicode string. It will crash the program or create security problems if len() return 1 for these characters even if it is the correct result according to Unicode standard. UTF-8 might be very useful in XML or Web processing, which is quite important in Python programming nowadays. But I think it is more important to let programmers "understand" the machanism. In C/C++, it is quite common to use char[] for ASCII (or ANSI) characters and wchar_t for unicode (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is actually "UTF-8" in PyPy. Web programmers who uses CPython may already be familiar with the differences between bytes (or str in Python2) and unicode (or str in Python3), it is less likely for them to design their programs based on special implementations of PyPy. 2016-03-07 hubo ????Maciej Fijalkowski ?????2016-03-07 16:46 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"hubo" ???"Armin Rigo","Piotr Jurkiewicz","PyPy Developer Mailing List" Hi hubo. I think you're slightly confusing two things. UTF-16 is a variable-length encoding that has two-word characters that *has to* return "1" for len() of those. UCS-2 seems closer to what you described (which is a fixed-width encoding), but can't encode all the unicode characters and as such is unsuitable for a modern unicode representation. I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the slicing and size calculations still has to be as complicated as for UTF-8. Complicated logic in repr() - those are not usually performance critical parts of your program and it's ok to have some complications there. It's true that UTF-16 can be less efficient than UTF-8 for certain languages, however both are more memory efficient than what we currently use (UCS4). There are however some problems - even if you work exclusively in, say, korean, for example web servers still have to deal with some parts that are ascii (html markup, css etc.) while handling text in korean. In those cases UTF8 vs UTF16 is more muddled and the exact details depend a lot. We also need to consider the fact that we ship one canonical PyPy to everybody - people using different languages and different encodings. Overall, UTF8 seems like definitely a better alternative than UCS4 (also for asian languages), which is what we are using now and I would be inclined to leave UTF16 as an option to see if it performs better for certain benchmarks. Best regards, Maciej Fijalkowski On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: > I think it is not reasonable to use UTF-8 to represent the unicode string > type. > > > 1. Less storage - this is not always true. It is only true for strings with > a lot of ASCII characters. In Asia, most strings in local languages > (Japanese, Chinese, Korean) are non-ASCII characters, they may consume more > storage than in UTF-16. To make things worse, while it always consumes 2*N > bytes for a N-characters string in UTF-16, it is difficult to estimate the > size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) > (UTF-16 also has two-word characters, but len() reports 2 for these > characters, I think it is not harmful to treat them as two characters) > > 2. There would be very complicate logics for size calculating and slicing. > For UTF-16, every character is represented with a 16-bit integer, so it is > convient for size calculating and slicing. But character in UTF-8 consumes > variant bytes, so either we call mb_* string functions instead (which is > slow in nature) or we use special logic like storing indices of characters > in another array (which introduces cost for extra addressings). > > 3. When displaying with repr(), non-ASCII characters are displayed with > \uXXXX format. If the internal storage for unicode is UTF-8, the only way to > be compatible with this format is to convert it back to UTF-16. > > It may be wiser to let programmers deside which encoding they would like to > use. If they want to process UTF-8 strings without performance cost on > converting, they should use "bytes". When correct size calculating and > slicing of non-ASCII characters are concerned it may be better to use > "unicode". > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Armin Rigo > ?????2016-03-05 16:09 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"Piotr Jurkiewicz" > ???"PyPy Developer Mailing List" > > Hi Piotr, > > Thanks for giving some serious thoughts to the utf8-stored unicode > string proposal! > > On 5 March 2016 at 01:48, Piotr Jurkiewicz > wrote: >> Random access would be as follows: >> >> page_num, byte_in_page = divmod(codepoint_pos, 64) >> page_start_byte = index[page_num] >> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >> return buffer[exact_byte] > > This is the part I'm least sure about: seek_forward() needs to be a > loop over 0 to 63 codepoints. True, each loop can be branchless, and > very short---let's say 4 instructions. But it still makes a total of > up to 252 instructions (plus the checks to know if we must go on). > These instructions are all or almost all dependent on the previous > one: you must have finished computing the length of one sequence to > even being computing the length of the next one. Maybe it's faster to > use a more "XMM-izable" algorithm which counts 0 for each byte in > 0x80-0xBF and 1 otherwise, and makes the sum. > > There are also variants, e.g. adding a second array of words similar > to 'index', but where each word is 8 packed bytes giving 8 starting > points inside the page (each in range 0-252). This would reduce the > walk to 0-7 codepoints. > > I'm +1 on your proposal. The whole thing is definitely worth a try. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 7 04:31:10 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:31:10 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD480B.2070709@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: I think you're misunderstanding what we're proposing. We're proposing utf8 representation completely hidden from the user, where everything behaves just like cpython unicode (the len() example you're showing is a narrow unicode build I presume?) On Mon, Mar 7, 2016 at 11:21 AM, hubo wrote: > Yes, there are two-words characters in UTF-16, as I mentioned. But len() in > CPython returns 2 for these characters (even if they are correctly processed > in repr()): > >>>> len(u'\ud805\udc09') > 2 >>>> u'\ud805\udc09' > u'\U00011409' > > (Python 3.x seems to have removed the display processing) > > Maybe it is better to be compatible with CPython in these situations. Since > two-words characters are really rare in Unicode strings, programmers may not > know their existence and allocate exactly 2 * len(s) bytes for storing an > unicode string. It will crash the program or create security problems if > len() return 1 for these characters even if it is the correct result > according to Unicode standard. > > UTF-8 might be very useful in XML or Web processing, which is quite > important in Python programming nowadays. But I think it is more important > to let programmers "understand" the machanism. In C/C++, it is quite common > to use char[] for ASCII (or ANSI) characters and wchar_t for unicode > (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is > actually "UTF-8" in PyPy. Web programmers who uses CPython may already be > familiar with the differences between bytes (or str in Python2) and unicode > (or str in Python3), it is less likely for them to design their programs > based on special implementations of PyPy. > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Maciej Fijalkowski > ?????2016-03-07 16:46 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"hubo" > ???"Armin Rigo","Piotr > Jurkiewicz","PyPy Developer Mailing > List" > > Hi hubo. > > I think you're slightly confusing two things. > > UTF-16 is a variable-length encoding that has two-word characters that > *has to* return "1" for len() of those. UCS-2 seems closer to what you > described (which is a fixed-width encoding), but can't encode all the > unicode characters and as such is unsuitable for a modern unicode > representation. > > I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the > slicing and size calculations still has to be as complicated as for > UTF-8. > > Complicated logic in repr() - those are not usually performance > critical parts of your program and it's ok to have some complications > there. > > It's true that UTF-16 can be less efficient than UTF-8 for certain > languages, however both are more memory efficient than what we > currently use (UCS4). There are however some problems - even if you > work exclusively in, say, korean, for example web servers still have > to deal with some parts that are ascii (html markup, css etc.) while > handling text in korean. In those cases UTF8 vs UTF16 is more muddled > and the exact details depend a lot. We also need to consider the fact > that we ship one canonical PyPy to everybody - people using different > languages and different encodings. > > Overall, UTF8 seems like definitely a better alternative than UCS4 > (also for asian languages), which is what we are using now and I would > be inclined to leave UTF16 as an option to see if it performs better > for certain benchmarks. > > Best regards, > Maciej Fijalkowski > > On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: >> I think it is not reasonable to use UTF-8 to represent the unicode string >> type. >> >> >> 1. Less storage - this is not always true. It is only true for strings >> with >> a lot of ASCII characters. In Asia, most strings in local languages >> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume >> more >> storage than in UTF-16. To make things worse, while it always consumes 2*N >> bytes for a N-characters string in UTF-16, it is difficult to estimate the >> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) >> (UTF-16 also has two-word characters, but len() reports 2 for these >> characters, I think it is not harmful to treat them as two characters) >> >> 2. There would be very complicate logics for size calculating and slicing. >> For UTF-16, every character is represented with a 16-bit integer, so it is >> convient for size calculating and slicing. But character in UTF-8 consumes >> variant bytes, so either we call mb_* string functions instead (which is >> slow in nature) or we use special logic like storing indices of characters >> in another array (which introduces cost for extra addressings). >> >> 3. When displaying with repr(), non-ASCII characters are displayed with >> \uXXXX format. If the internal storage for unicode is UTF-8, the only way >> to >> be compatible with this format is to convert it back to UTF-16. >> >> It may be wiser to let programmers deside which encoding they would like >> to >> use. If they want to process UTF-8 strings without performance cost on >> converting, they should use "bytes". When correct size calculating and >> slicing of non-ASCII characters are concerned it may be better to use >> "unicode". >> >> 2016-03-07 >> ________________________________ >> hubo >> ________________________________ >> >> ????Armin Rigo >> ?????2016-03-05 16:09 >> ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage >> ????"Piotr Jurkiewicz" >> ???"PyPy Developer Mailing List" >> >> Hi Piotr, >> >> Thanks for giving some serious thoughts to the utf8-stored unicode >> string proposal! >> >> On 5 March 2016 at 01:48, Piotr Jurkiewicz >> wrote: >>> Random access would be as follows: >>> >>> page_num, byte_in_page = divmod(codepoint_pos, 64) >>> page_start_byte = index[page_num] >>> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >>> return buffer[exact_byte] >> >> This is the part I'm least sure about: seek_forward() needs to be a >> loop over 0 to 63 codepoints. True, each loop can be branchless, and >> very short---let's say 4 instructions. But it still makes a total of >> up to 252 instructions (plus the checks to know if we must go on). >> These instructions are all or almost all dependent on the previous >> one: you must have finished computing the length of one sequence to >> even being computing the length of the next one. Maybe it's faster to >> use a more "XMM-izable" algorithm which counts 0 for each byte in >> 0x80-0xBF and 1 otherwise, and makes the sum. >> >> There are also variants, e.g. adding a second array of words similar >> to 'index', but where each word is 8 packed bytes giving 8 starting >> points inside the page (each in range 0-252). This would reduce the >> walk to 0-7 codepoints. >> >> I'm +1 on your proposal. The whole thing is definitely worth a try. >> >> >> A bient?t, >> >> Armin. >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> From fijall at gmail.com Mon Mar 7 04:33:19 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 7 Mar 2016 11:33:19 +0200 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DA2CE9.5070409@gmail.com> References: <56DA2CE9.5070409@gmail.com> Message-ID: Hi Piotr. Any chance to have a chat with you about the proposal on a more real-time communication medium like IRC or GChat? (it's #pypy on IRC and use my mail for gchat) On Sat, Mar 5, 2016 at 2:48 AM, Piotr Jurkiewicz wrote: > Hi PyPy devs, > > my name is Piotr Jurkiewicz and I am a first-year PhD student at > the AGH University of Science and Technology, Krak?w, Poland. > > I am writing this email to make sure that PyPy is going to > participate in GSoC 2016, since I am interested in one of the > proposed projects: Optimized Unicode Representation > > Below is a list of my ideas and plan for the project. > > (I use Python 2 nomenclature, that is unicode strings are > `unicode` objects and bytes strings are `str` objects.) > > 1. Store all unicode objects contents internally as UTF-8. > > This would reduce size of stored contents and allow external > libraries, which expect UTF-8, to process contents directly in the > memory (for example using various regexp libraries to search unicode > string). > > 2. Unify interning caches for str and unicode. > > This would allow unicode objects and corresponding > utf8-encoded-str objects to share the same interned buffer. > > For example unicode object u'ko?' would share interned buffer > with str 'ko\xc5\x84'. > > This would make unicode.encode('utf-8') basically no op. As UTF-8 > becomes dominant encoding for any data exchange, including web (86%) > [1], more and more data coming out from Python scripts needs to be > UTF-8 encoded. Therefore, it is important to make this operation as > cheap as possible. > > It would speed up str.decode('utf-8') significantly too, although it > wouldn't make it no op. String still would need to be checked if it > is a correct UTF-8 string when transforming to unicode object. But > we can get rid of additional allocation, copying string contents and > storing it twice, in CONST_STR_CACHE and CONST_UNICODE_CACHE. > > 3. Indexing of codepoints positions, what would allow O(1) random > access and slicing. > > The idea is simple: alongside contents of each interned unicode > object, store an array of unsigned integers. These integers will > be positions (in bytes), counting from the beginning of the buffer, > at which each next 64-codepoint-long 'pages' start. > > Random access would be as follows: > > page_num, byte_in_page = divmod(codepoint_pos, 64) > page_start_byte = index[page_num] > exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) > return buffer[exact_byte] > > Using 64-byte long pages, like in the example above, would allow > O(1) random access, with constant terms of: > > - one cache access in cases of only-ASCII texts (indexes for such > unicode objects will not be created and maintained) > - three cache accesses in cases of texts consisting of ASCII mixed > with two-byte characters (Latin, Greek, Cyrillic, Hebrew, Arabic > alphabets) > - four or five cache accesses in cases of texts consisting mostly of > three- and four- byte characters > > (all above assuming 64-byte long CPU cache lines) > > Memory overhead associated with storing index array would be in > range 0 - 6.25%. (or 0 - 12.5% if unicode objects longer than 2^32 > codepoints will be allowed) > > (assuming that the index array consists of integers of smallest > possible type which can store buffer_bytes_len - 1) > > 4. Fast codepoints counting/seeking with branchless algorithm [2]. > > When unicode object is interned, we are sure that it is a correct > UTF-8 string. Therefore, there is no need for correctness checking > when seeking, so a branchless algorithm can be used. > > [1]: http://w3techs.com/technologies/details/en-utf8/all/all > [2]: > http://blogs.perl.org/users/nick_wellnhofer/2015/04/branchless-utf-8-length.html > > All of these changes can be introduced one at a time, what would > improve tracking of performance changes and debugging of eventual > errors. > > After completing the project I plan to write a paper describing > speedup method of random access unicode access based on indexing, as > this method has a potential for being used in other language > interpreters which have immutable and/or interned unicode strings. > Note that similar index can be created for graphemes as well, so > this method can be used in languages which provide grapheme-based > interface (like Perl 6). > > Please share your thoughts about these ideas. > > Cheers, > Piotr > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From hubo at jiedaibao.com Mon Mar 7 04:45:51 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 17:45:51 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: <56DD4DCB.3070407@jiedaibao.com> Yes, it seems CPython 2.7 in Windows uses UTF-16, so: >>> '\ud805\udc09' '\\ud805\\udc09' >>> u'\ud805\udc09' u'\U00011409' >>> u'\ud805\udc09' == u'\U00011409' True >>> len(u'\U00011409') 2 In Linux CPython 2.7: >>> u'\U00011409' u'\U00011409' >>> len(u'\U00011409') 1 >>> u'\ud805\udc09' u'\ud805\udc09' >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' == u'\U00011409' False >>> u'\ud805\udc09'.encode('utf-8') '\xf0\x91\x90\x89' >>> u'\U00011409'.encode('utf-8') '\xf0\x91\x90\x89' >>> u'\ud805\udc09'.encode('utf-8') == u'\U00011409'.encode('utf-8') True 2016-03-07 hubo ????Maciej Fijalkowski ?????2016-03-07 17:31 ???Re: Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"hubo" ???"Armin Rigo","Piotr Jurkiewicz","PyPy Developer Mailing List" I think you're misunderstanding what we're proposing. We're proposing utf8 representation completely hidden from the user, where everything behaves just like cpython unicode (the len() example you're showing is a narrow unicode build I presume?) On Mon, Mar 7, 2016 at 11:21 AM, hubo wrote: > Yes, there are two-words characters in UTF-16, as I mentioned. But len() in > CPython returns 2 for these characters (even if they are correctly processed > in repr()): > >>>> len(u'\ud805\udc09') > 2 >>>> u'\ud805\udc09' > u'\U00011409' > > (Python 3.x seems to have removed the display processing) > > Maybe it is better to be compatible with CPython in these situations. Since > two-words characters are really rare in Unicode strings, programmers may not > know their existence and allocate exactly 2 * len(s) bytes for storing an > unicode string. It will crash the program or create security problems if > len() return 1 for these characters even if it is the correct result > according to Unicode standard. > > UTF-8 might be very useful in XML or Web processing, which is quite > important in Python programming nowadays. But I think it is more important > to let programmers "understand" the machanism. In C/C++, it is quite common > to use char[] for ASCII (or ANSI) characters and wchar_t for unicode > (actually UTF-16, or UCS-2) characters, so it may be suprising if unicode is > actually "UTF-8" in PyPy. Web programmers who uses CPython may already be > familiar with the differences between bytes (or str in Python2) and unicode > (or str in Python3), it is less likely for them to design their programs > based on special implementations of PyPy. > > 2016-03-07 > ________________________________ > hubo > ________________________________ > > ????Maciej Fijalkowski > ?????2016-03-07 16:46 > ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage > ????"hubo" > ???"Armin Rigo","Piotr > Jurkiewicz","PyPy Developer Mailing > List" > > Hi hubo. > > I think you're slightly confusing two things. > > UTF-16 is a variable-length encoding that has two-word characters that > *has to* return "1" for len() of those. UCS-2 seems closer to what you > described (which is a fixed-width encoding), but can't encode all the > unicode characters and as such is unsuitable for a modern unicode > representation. > > I'll discard UCS-2 as unsuitable and were we to use UTF-16, then the > slicing and size calculations still has to be as complicated as for > UTF-8. > > Complicated logic in repr() - those are not usually performance > critical parts of your program and it's ok to have some complications > there. > > It's true that UTF-16 can be less efficient than UTF-8 for certain > languages, however both are more memory efficient than what we > currently use (UCS4). There are however some problems - even if you > work exclusively in, say, korean, for example web servers still have > to deal with some parts that are ascii (html markup, css etc.) while > handling text in korean. In those cases UTF8 vs UTF16 is more muddled > and the exact details depend a lot. We also need to consider the fact > that we ship one canonical PyPy to everybody - people using different > languages and different encodings. > > Overall, UTF8 seems like definitely a better alternative than UCS4 > (also for asian languages), which is what we are using now and I would > be inclined to leave UTF16 as an option to see if it performs better > for certain benchmarks. > > Best regards, > Maciej Fijalkowski > > On Mon, Mar 7, 2016 at 9:58 AM, hubo wrote: >> I think it is not reasonable to use UTF-8 to represent the unicode string >> type. >> >> >> 1. Less storage - this is not always true. It is only true for strings >> with >> a lot of ASCII characters. In Asia, most strings in local languages >> (Japanese, Chinese, Korean) are non-ASCII characters, they may consume >> more >> storage than in UTF-16. To make things worse, while it always consumes 2*N >> bytes for a N-characters string in UTF-16, it is difficult to estimate the >> size of a N-characters string in UTF-8 (may be N bytes to 3 * N bytes) >> (UTF-16 also has two-word characters, but len() reports 2 for these >> characters, I think it is not harmful to treat them as two characters) >> >> 2. There would be very complicate logics for size calculating and slicing. >> For UTF-16, every character is represented with a 16-bit integer, so it is >> convient for size calculating and slicing. But character in UTF-8 consumes >> variant bytes, so either we call mb_* string functions instead (which is >> slow in nature) or we use special logic like storing indices of characters >> in another array (which introduces cost for extra addressings). >> >> 3. When displaying with repr(), non-ASCII characters are displayed with >> \uXXXX format. If the internal storage for unicode is UTF-8, the only way >> to >> be compatible with this format is to convert it back to UTF-16. >> >> It may be wiser to let programmers deside which encoding they would like >> to >> use. If they want to process UTF-8 strings without performance cost on >> converting, they should use "bytes". When correct size calculating and >> slicing of non-ASCII characters are concerned it may be better to use >> "unicode". >> >> 2016-03-07 >> ________________________________ >> hubo >> ________________________________ >> >> ????Armin Rigo >> ?????2016-03-05 16:09 >> ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage >> ????"Piotr Jurkiewicz" >> ???"PyPy Developer Mailing List" >> >> Hi Piotr, >> >> Thanks for giving some serious thoughts to the utf8-stored unicode >> string proposal! >> >> On 5 March 2016 at 01:48, Piotr Jurkiewicz >> wrote: >>> Random access would be as follows: >>> >>> page_num, byte_in_page = divmod(codepoint_pos, 64) >>> page_start_byte = index[page_num] >>> exact_byte = seek_forward(buffer[page_start_byte], byte_in_page) >>> return buffer[exact_byte] >> >> This is the part I'm least sure about: seek_forward() needs to be a >> loop over 0 to 63 codepoints. True, each loop can be branchless, and >> very short---let's say 4 instructions. But it still makes a total of >> up to 252 instructions (plus the checks to know if we must go on). >> These instructions are all or almost all dependent on the previous >> one: you must have finished computing the length of one sequence to >> even being computing the length of the next one. Maybe it's faster to >> use a more "XMM-izable" algorithm which counts 0 for each byte in >> 0x80-0xBF and 1 otherwise, and makes the sum. >> >> There are also variants, e.g. adding a second array of words similar >> to 'index', but where each word is 8 packed bytes giving 8 starting >> points inside the page (each in range 0-252). This would reduce the >> walk to 0-7 codepoints. >> >> I'm +1 on your proposal. The whole thing is definitely worth a try. >> >> >> A bient?t, >> >> Armin. >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> >> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From tritium-list at sdamon.com Mon Mar 7 05:27:38 2016 From: tritium-list at sdamon.com (Alexander Walters) Date: Mon, 7 Mar 2016 05:27:38 -0500 Subject: [pypy-dev] Release 5.0.0 In-Reply-To: References: <56DB4CFB.6020007@gmail.com> <56DC9AAA.1080003@gmail.com> Message-ID: <56DD579A.8080603@sdamon.com> Forking is not an option on windows (it lacks fork.) On 3/7/2016 04:16, Maciej Fijalkowski wrote: > I have no idea how memory management works on windows (I doubt this > will solve it), but this is how we do that on linux > > On Mon, Mar 7, 2016 at 10:55 AM, Yury V. Zaytsev wrote: >> On Sun, 6 Mar 2016, Maciej Fijalkowski wrote: >> >>> It uses subprocess, but you need to quit pypy (so run this with --source >>> and then make separately) for memory to be reclaimed >> >> Do you think that pre-forking a process for compilation right at the >> beginning of the translation when PyPy hasn't consumed much memory yet would >> be a viable solution? >> >> I think if this is practical, it would be a much user friendlier solution as >> compared to two-step process (translation + compilation). If memory serves >> me well, this is one of the strategies that subprocess in Python 3 is using >> to improve on memory consumption. >> >> >> -- >> Sincerely yours, >> Yury V. Zaytsev > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From steve at pearwood.info Mon Mar 7 06:45:45 2016 From: steve at pearwood.info (Steven D'Aprano) Date: Mon, 7 Mar 2016 22:45:45 +1100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> Message-ID: <20160307114545.GZ12028@ando.pearwood.info> On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?) Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve From hubo at jiedaibao.com Mon Mar 7 07:49:24 2016 From: hubo at jiedaibao.com (hubo) Date: Mon, 07 Mar 2016 20:49:24 +0800 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <20160307114545.GZ12028@ando.pearwood.info> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> Message-ID: <56DD78D1.30309@jiedaibao.com> Thanks for the link! It is interesting that in Python3.5, still >>> len(u'\ud805\udc09') 2 >>> u'\ud805\udc09' == u'\U00011409' False I think in Python 3.x, u'\ud805\udc09' is not another format of u'\U00011409', it is just an illegal unicode string. It also raises UnicodeEncodeError if you try to encode it into UTF-8. The problem is that it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as the internal storage format, I don't think it is possible to keep these details same as CPython, but it should be acceptable. Thanks again for the discussion. Unicode is really complicated. 2016-03-07 hubo ????Steven D'Aprano ?????2016-03-07 19:45 ???Re: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage ????"pypy-dev" ??? On Mon, Mar 07, 2016 at 11:31:10AM +0200, Maciej Fijalkowski wrote: > I think you're misunderstanding what we're proposing. > > We're proposing utf8 representation completely hidden from the user, > where everything behaves just like cpython unicode (the len() example > you're showing is a narrow unicode build I presume?) Yes, CPython narrow builds don't handle Unicode code points in the supplementary planes well: they wrongly return len(2) for code points with a 4-byte UTF-16 representation: steve at runes:~$ python2.6 -c "print len(u'\U0010FFFF')" # wide build 1 steve at runes:~$ python2.7 -c "print len(u'\U0010FFFF')" # narrow build 2 That is no longer the case since Python 3.3, when the "flexible string representation" was introduced. https://www.python.org/dev/peps/pep-0393/ I think that it would be a very valuable experiment for PyPy to investigate moving to a UTF-8 internal representation. -- Steve _______________________________________________ pypy-dev mailing list pypy-dev at python.org https://mail.python.org/mailman/listinfo/pypy-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Tue Mar 8 06:49:47 2016 From: matti.picus at gmail.com (matti picus) Date: Tue, 8 Mar 2016 13:49:47 +0200 Subject: [pypy-dev] release seems ready Message-ID: It seems we have a release, version ad5a4e55fa8e. Is there a reason to wait? buildbots http://buildbot.pypy.org/summary?branch=release-5.x release notice http://doc.pypy.org/en/latest/release-5.0.0.html Hopefully we can release 5.1 once s360-x lands on default Matti -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 8 08:41:36 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 15:41:36 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: yay! can we call it rc1? if noone objects we'll make rc1 the release say in 24 or 48h On Tue, Mar 8, 2016 at 1:49 PM, matti picus wrote: > It seems we have a release, version ad5a4e55fa8e. Is there a reason to wait? > buildbots http://buildbot.pypy.org/summary?branch=release-5.x > release notice http://doc.pypy.org/en/latest/release-5.0.0.html > > Hopefully we can release 5.1 once s360-x lands on default > Matti > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From matti.picus at gmail.com Tue Mar 8 09:15:34 2016 From: matti.picus at gmail.com (matti picus) Date: Tue, 8 Mar 2016 16:15:34 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: We could package it and upload as rc1, but version_info will not have rc1 unless we rerun the builds. Confusing. I prefer to apologize if we get it wrong and release a 5.0.1 bugfix Matti On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: > yay! > > can we call it rc1? if noone objects we'll make rc1 the release say in 24 > or 48h > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Mar 8 09:26:10 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 15:26:10 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: Hi Matti, On 8 March 2016 at 15:15, matti picus wrote: > We could package it and upload as rc1, but version_info will not have rc1 > unless we rerun the builds. Confusing. > I prefer to apologize if we get it wrong and release a 5.0.1 bugfix +1. Go ahead as far as I'm concerned. About the release notice: "As a result, lxml with its cython compiled component passes all tests on PyPy" is not clear until the next official lxml is released. The current lxml 3.5.0 still contains a partially buggy workaround that tries to make it work on previous versions of cpyext. The trunk version at https://github.com/lxml/lxml has got this code removed, and that's the version that works. I'll make the ppc releases once the other releases are out. A bient?t, Armin From phyo.arkarlwin at gmail.com Tue Mar 8 09:22:56 2016 From: phyo.arkarlwin at gmail.com (Phyo Arkar) Date: Tue, 08 Mar 2016 14:22:56 +0000 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: I am going to test it out , quite interesting release. On Tue, Mar 8, 2016 at 8:12 PM Maciej Fijalkowski wrote: > yay! > > can we call it rc1? if noone objects we'll make rc1 the release say in 24 > or 48h > > On Tue, Mar 8, 2016 at 1:49 PM, matti picus wrote: > > It seems we have a release, version ad5a4e55fa8e. Is there a reason to > wait? > > buildbots http://buildbot.pypy.org/summary?branch=release-5.x > > release notice http://doc.pypy.org/en/latest/release-5.0.0.html > > > > Hopefully we can release 5.1 once s360-x lands on default > > Matti > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 8 09:36:01 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:36:01 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: I'm ok with making it official 5.0. We can always do 5.0.1 if there are problems On Tue, Mar 8, 2016 at 4:15 PM, matti picus wrote: > We could package it and upload as rc1, but version_info will not have rc1 > unless we rerun the builds. Confusing. > I prefer to apologize if we get it wrong and release a 5.0.1 bugfix > Matti > > On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: >> >> yay! >> >> can we call it rc1? if noone objects we'll make rc1 the release say in 24 >> or 48h >> >> > From fijall at gmail.com Tue Mar 8 09:42:05 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:42:05 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: btw, should we mention packages.pypy.org? On Tue, Mar 8, 2016 at 4:36 PM, Maciej Fijalkowski wrote: > I'm ok with making it official 5.0. We can always do 5.0.1 if there are problems > > On Tue, Mar 8, 2016 at 4:15 PM, matti picus wrote: >> We could package it and upload as rc1, but version_info will not have rc1 >> unless we rerun the builds. Confusing. >> I prefer to apologize if we get it wrong and release a 5.0.1 bugfix >> Matti >> >> On Tuesday, 8 March 2016, Maciej Fijalkowski wrote: >>> >>> yay! >>> >>> can we call it rc1? if noone objects we'll make rc1 the release say in 24 >>> or 48h >>> >>> >> From arigo at tunes.org Tue Mar 8 09:46:26 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 15:46:26 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: Hi, On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: > btw, should we mention packages.pypy.org? I would do so but only under two conditions: * it reports a post-cpyext-fixes result: which packages run or don't run now, ideally on the current "release 5.0" branch, but at least after the merge of the cpyext-gc-support-2 branch * we quickly review and fix the few manual comments, notably lxml's (we no longer recommend lxml-cffi). A bient?t, Armin From fijall at gmail.com Tue Mar 8 09:52:56 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:52:56 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: Cool, I'm happy to do the suggested fixes. We rerun it every release usually, changes by hand are done earlier. Should I start a run on the current release branch? On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: > Hi, > > On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >> btw, should we mention packages.pypy.org? > > I would do so but only under two conditions: > > * it reports a post-cpyext-fixes result: which packages run or don't > run now, ideally on the current "release 5.0" branch, but at least > after the merge of the cpyext-gc-support-2 branch > > * we quickly review and fix the few manual comments, notably lxml's > (we no longer recommend lxml-cffi). > > > A bient?t, > > Armin From fijall at gmail.com Tue Mar 8 09:53:19 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 16:53:19 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: in other words, it shows the last release of pypy, not "trunk" On Tue, Mar 8, 2016 at 4:52 PM, Maciej Fijalkowski wrote: > Cool, I'm happy to do the suggested fixes. > > We rerun it every release usually, changes by hand are done earlier. > Should I start a run on the current release branch? > > On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: >> Hi, >> >> On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >>> btw, should we mention packages.pypy.org? >> >> I would do so but only under two conditions: >> >> * it reports a post-cpyext-fixes result: which packages run or don't >> run now, ideally on the current "release 5.0" branch, but at least >> after the merge of the cpyext-gc-support-2 branch >> >> * we quickly review and fix the few manual comments, notably lxml's >> (we no longer recommend lxml-cffi). >> >> >> A bient?t, >> >> Armin From fijall at gmail.com Tue Mar 8 10:11:56 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 8 Mar 2016 17:11:56 +0200 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: ugh, btw, it seems someone broke embedding (as advertised, probably the cffi embedding still works) On Tue, Mar 8, 2016 at 4:53 PM, Maciej Fijalkowski wrote: > in other words, it shows the last release of pypy, not "trunk" > > On Tue, Mar 8, 2016 at 4:52 PM, Maciej Fijalkowski wrote: >> Cool, I'm happy to do the suggested fixes. >> >> We rerun it every release usually, changes by hand are done earlier. >> Should I start a run on the current release branch? >> >> On Tue, Mar 8, 2016 at 4:46 PM, Armin Rigo wrote: >>> Hi, >>> >>> On 8 March 2016 at 15:42, Maciej Fijalkowski wrote: >>>> btw, should we mention packages.pypy.org? >>> >>> I would do so but only under two conditions: >>> >>> * it reports a post-cpyext-fixes result: which packages run or don't >>> run now, ideally on the current "release 5.0" branch, but at least >>> after the merge of the cpyext-gc-support-2 branch >>> >>> * we quickly review and fix the few manual comments, notably lxml's >>> (we no longer recommend lxml-cffi). >>> >>> >>> A bient?t, >>> >>> Armin From arigo at tunes.org Tue Mar 8 10:16:21 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 16:16:21 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: <56DD78D1.30309@jiedaibao.com> References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com> Message-ID: Hi hubo, On 7 March 2016 at 13:49, hubo wrote: > I think in Python 3.x, u'\ud805\udc09' is not another format of > u'\U00011409', it is just an illegal unicode string. It also raises > UnicodeEncodeError if you try to encode it into UTF-8. The problem is that > it is legal to define and use these strings. If PyPy uses UTF-8 or UTF-16 as > the internal storage format, I don't think it is possible to keep these > details same as CPython, but it should be acceptable. We're good at keeping obscure details the same as CPython. It's only a matter of adding the correct checks on top of the encode() and decode() methods, independently of the underlying representation. In this case, because we can consider the length-1 unicode string u'\ud805', then we have to internally represent it somehow, and the natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus not using utf-8 internally, but "utf-8-without-extra-consistency-checks". In Python 2, u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a single code point of 4 bytes. This means that calling ``decode('utf-8')`` has to check for surrogates, and do something more complicated on Python 2.x (or complain on Python 3.x). In other words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be no-ops. Decoding and encoding need to check the data, and might actually need to make a copy in corner cases, but not in the vast majority of cases. This is all focused on the web and generally Linux approach of "utf-8 everywhere". For Windows, the story is more complicated. CPython 2.x uses UTF-16, like the Windows API. However, the recent CPython 3.x moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). If you are on a recent CPython 3.x and build a unicode object with a large codepoint, and then call the Windows API with it, it will need anyway to convert it to UTF-16 dynamically, as far as I can tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is discussed here, it would instead have to convert from utf-8-without-extra-consistency-checks to UTF-16 in that situation. There are definitely trade-offs to explore, but I doubt that we can fully explore these trade-offs without actually trying it out. A bient?t, Armin. From robin.kruppe at gmail.com Tue Mar 8 11:10:57 2016 From: robin.kruppe at gmail.com (Robin Kruppe) Date: Tue, 8 Mar 2016 17:10:57 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com> Message-ID: Hi all, I just wanted to mention that several other language implementors have faced the same problem of dealing with "UTF-16" containing lone surrogate code points and representing it in "UTF-8", and they have come up with essentially the same solution. Users include the Racket, Scheme 48, and Rust languages (all three only for I/O on Windows) and the Servo browser engine (for the sake of JavaScript). Recently Simon Sapin of Mozilla has spec'd this trick in exhausting detail, christening it WTF-8: https://simonsapin.github.io/wtf-8/ While everything described there may be pretty obvious (for those immersed in the guts of Unicode), I wanted to raise awareness that this has a name and other users. Cheers, Robin On 8 March 2016 at 16:16, Armin Rigo wrote: > Hi hubo, > > On 7 March 2016 at 13:49, hubo wrote: > > I think in Python 3.x, u'\ud805\udc09' is not another format of > > u'\U00011409', it is just an illegal unicode string. It also raises > > UnicodeEncodeError if you try to encode it into UTF-8. The problem is > that > > it is legal to define and use these strings. If PyPy uses UTF-8 or > UTF-16 as > > the internal storage format, I don't think it is possible to keep these > > details same as CPython, but it should be acceptable. > > We're good at keeping obscure details the same as CPython. It's only > a matter of adding the correct checks on top of the encode() and > decode() methods, independently of the underlying representation. > > In this case, because we can consider the length-1 unicode string > u'\ud805', then we have to internally represent it somehow, and the > natural way would be to represent it as the 3 bytes '\xed\xa0\x85'. > So for u'\ud805\udc09' we use 6 bytes. Strictly speaking, we're thus > not using utf-8 internally, but > "utf-8-without-extra-consistency-checks". In Python 2, > u'\ud805\udc09'.decode('utf-8') returns '\xf0\x91\x90\x89', i.e. a > single code point of 4 bytes. This means that calling > ``decode('utf-8')`` has to check for surrogates, and do something more > complicated on Python 2.x (or complain on Python 3.x). In other > words, neither ``decode('utf-8')`` nor ``encode('utf-8')`` can be > no-ops. Decoding and encoding need to check the data, and might > actually need to make a copy in corner cases, but not in the vast > majority of cases. > > This is all focused on the web and generally Linux approach of "utf-8 > everywhere". For Windows, the story is more complicated. CPython 2.x > uses UTF-16, like the Windows API. However, the recent CPython 3.x > moved anyway towards a variable-encoding model of UCS-4 (==UTF-32). > If you are on a recent CPython 3.x and build a unicode object with a > large codepoint, and then call the Windows API with it, it will need > anyway to convert it to UTF-16 dynamically, as far as I can > tell---i.e. convert from UCS-4 to UTF-16. In the proposal that is > discussed here, it would instead have to convert from > utf-8-without-extra-consistency-checks to UTF-16 in that situation. > > There are definitely trade-offs to explore, but I doubt that we can > fully explore these trade-offs without actually trying it out. > > > A bient?t, > > Armin. > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Tue Mar 8 11:30:12 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 8 Mar 2016 17:30:12 +0100 Subject: [pypy-dev] Interest in GSoC project: UTF-8 internal unicode storage In-Reply-To: References: <56DA2CE9.5070409@gmail.com> <56DD3493.8020800@jiedaibao.com> <56DD480B.2070709@jiedaibao.com> <20160307114545.GZ12028@ando.pearwood.info> <56DD78D1.30309@jiedaibao.com> Message-ID: Hi Robin, On 8 March 2016 at 17:10, Robin Kruppe wrote: > I just wanted to mention that several other language implementors have faced > ... > While everything described there may be pretty obvious (for those immersed > in the guts of Unicode), I wanted to raise awareness that this has a name > and other users. Thanks! We'd be using the "generalized UTF-8" from https://simonsapin.github.io/wtf-8/, in principle. We'd not be using WTF-8 because it considers that u'\ud805\udc09' == u'\U00011409', whereas CPython does not, generally. A bient?t, Armin. From djkonro35 at gmail.com Wed Mar 9 08:12:03 2016 From: djkonro35 at gmail.com (Djimeli Konrad) Date: Wed, 9 Mar 2016 14:12:03 +0100 Subject: [pypy-dev] Interest in contributing to PYPY Message-ID: Hello, My name is Djimeli Konrad a second year computer science student from the University of Buea, Cameroon. I am proficient in c, c++, javascript and python. I would like to contribute to PYPY for the Google Summer of Code 2016. I am interested in working on the project "Improving the jitviewer". I have previous experience developing Django/Python applications ( https://github.com/MCQuizzer/mcquizzer/graphs/contributors ), VRML-STL parser hosted on github ( https://github.com/djkonro/vrml-stl ) and other project ( https://github.com/djkonro ). I would like to work on this project within and beyond GSoC and as I have always sought for such a project ever since I learned python and web application development.I would like to get some pointer to some starting point that could give me a better understanding of the project. Thanks Konrad -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 10 01:41:03 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 10 Mar 2016 08:41:03 +0200 Subject: [pypy-dev] Interest in contributing to PYPY In-Reply-To: References: Message-ID: Hi! Good to hear from you :-) Any chance you can pop in to IRC, so we can discuss the project? Alternatively you can catch me on gmail on this address Best regards, Maciej Fijalkowski On Wed, Mar 9, 2016 at 3:12 PM, Djimeli Konrad wrote: > Hello, > > My name is Djimeli Konrad a second year computer science student from the > University of Buea, Cameroon. I am proficient in c, c++, javascript and > python. I would like to contribute to PYPY for the Google Summer of Code > 2016. I am interested in working on the project "Improving the jitviewer". I > have previous experience developing Django/Python applications ( > https://github.com/MCQuizzer/mcquizzer/graphs/contributors ), VRML-STL > parser hosted on github ( https://github.com/djkonro/vrml-stl ) and other > project ( https://github.com/djkonro ). I would like to work on this > project within and beyond GSoC and as I have always sought for such a > project ever since I learned python and web application development.I would > like to get some pointer to some starting point that could give me a better > understanding of the project. > > Thanks > Konrad > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From ishankhare07 at gmail.com Thu Mar 10 11:31:36 2016 From: ishankhare07 at gmail.com (Ishan Khare) Date: Thu, 10 Mar 2016 16:31:36 +0000 Subject: [pypy-dev] Contribute in GSOC Message-ID: Hi, I am a newcomer to contributing to pypy, but I'm fairly good in python & c. I would like to contribute to PyPy. Are all ideas listed in Potential project list eligible for GSOC. Where should I probably get started? Regards, Ishan -------------- next part -------------- An HTML attachment was scrubbed... URL: From cfbolz at gmx.de Fri Mar 11 09:09:31 2016 From: cfbolz at gmx.de (Carl Friedrich Bolz) Date: Fri, 11 Mar 2016 15:09:31 +0100 Subject: [pypy-dev] Call for Papers Programming Experience 2016 Message-ID: <56E2D19B.3030505@gmx.de> Call for Papers *** Programming Experience 2016 (PX/16) Workshop *** July 18 (Mon), 2016 Co-located with ECOOP 2016 in Rome 2016.ecoop.org/track/PX-2016 programming-experience.org/px16 === Abstract === Imagine a software development task. Some sort of requirements and specification including performance goals and perhaps a platform and programming language. A group of developers head into a vast workroom. The Programming Experience Workshop is about what happens in that room when one or a couple of programmers sit down in front of computers and produce code, especially when it's exploratory programming. Do they create text that is transformed into running behavior (the old way), or do they operate on behavior directly ("liveness"); are they exploring the live domain to understand the true nature of the requirements; are they like authors creating new worlds; does visualization matter; is the experience immediate, immersive, vivid and continuous; do fluency, literacy, and learning matter; do they build tools, meta-tools; are they creating languages to express new concepts quickly and easily; and curiously, is joy relevant to the experience? Correctness, performance, standard tools, foundations, and text-as-program are important traditional research areas, but the experience of programming and how to improve and evolve it are the focus of this workshop. === Submissions === Submissions are solicited for Programming Experience 2016 (PX/16). The thrust of the workshop is to explore the human experience of programming?what it feels like to program, or more accurately, what it should feel like. The technical topics include exploratory programming, live programming, authoring, representation of active content, visualization, navigation, modularity mechanisms, immediacy, literacy, fluency, learning, tool building, and language engineering. Submissions by academics, professional programmers, and non-professional programmer are welcome. Submissions can be in any form and format, including but not limited to papers, presentations, demos, videos, panels, debates, essays, writers' workshops, and art. Presentation slots will be between 30 minutes and one hour, depending on quality, form, and relevance to the workshop. Submissions directed toward publication should be so marked, and the program committee will engage in peer review for all such papers. Video publication will be arranged. All artifacts are to be submitted via EasyChair (https://easychair.org/conferences/?conf=px16). Papers and essays must be written in English, provided as PDF documents, and follow the ACM SIGPLAN Conference Format (10 point font, Times New Roman font family, numeric citation style, http://www.sigplan.org/Resources/Author/). There is no page limit on submitted papers and essays. It is, however, the responsibility of the authors to keep the reviewers interested and motivated to read the paper. Reviewers are under no obligation to read all or even a substantial portion of a paper or essay if they do not find the initial part of it interesting. === Format === Paper presentations, presentations without papers, live demonstrations, performances, videos, panel discussions, debates, writers' workshops, art galleries, dramatic readings. === Review === Papers and essays labeled as publications will undergo standard peer review; other submissions will be reviewed for relevance and quality; shepherding will be available. === Important dates === Submissions: April 15, 2016 (anywhere in the world) Notifications: May 13, 2016 PX/16: July 18, 2016 === Publication === Papers and essays accepted through peer review will be published as part of ACM's Digital Library; video publication on Vimeo or other streaming site; other publication on the PX workshop website. === Organizers === Robert Hirschfeld, Hasso Plattner Institute, University of Potsdam, Germany Richard P. Gabriel, Dreamsongs and IBM Almaden Research Center, United States Hidehiko Masuhara, Mathematical and Computing Science, Tokyo Institute of Technology, Japan === Program committee === Carl Friedrich Bolz, King's College London, United Kingdom Gilad Bracha, Google, United States Andrew Bragdon, Twitter, United States Jonathan Edwards, CDG Labs, United States Jun Kato, National Institute of Advanced Industrial Science and Technology, Japan Cristina Videira Lopes, University of California at Irvine, United States Yoshiki Ohshima, Viewpoints Research Institute, United States Michael Perscheid, SAP Innovation Center, Germany Guido Salvaneschi, TU Darmstadt, Germany Marcel Taeumel, Hasso Plattner Institute, University of Potsdam, Germany Alessandro Warth, SAP Labs, United States From nkumar736 at gmail.com Fri Mar 11 16:58:30 2016 From: nkumar736 at gmail.com (Naveen Kumar) Date: Sat, 12 Mar 2016 03:28:30 +0530 Subject: [pypy-dev] GSoC 2016 Message-ID: Hello, I'm Naveen Kumar, an Information Science Engineering student from Bangalore, India. I got to know about PyPy from a book that I started studying the book "Expert Python Programming" by Tarek Ziad? and I was totally Intrigued. I take this opportunity to be a part of the community and contribute actively. As for me, I've been using Python from the past 8 months and I built a Blog using Flask (following the footsteps of Miguel Grinberg) with features like a Music Player. Other than that, I do not have much of an experience. Again, I'd love to be a part of the community and I'd like to be guided on how to go about it. Thanks, Naveen (nkumar736 at gmail.com) -------------- next part -------------- An HTML attachment was scrubbed... URL: From pabi.lenka at gmail.com Sat Mar 12 05:52:43 2016 From: pabi.lenka at gmail.com (Pabitra Lenka) Date: Sat, 12 Mar 2016 16:22:43 +0530 Subject: [pypy-dev] TO GET STARTED Message-ID: Greetings Developers, I am a newbie.I would like to contribute to your organization.Can anyone get me started.? -- Cheers, Pabitra Lenka Department of Information Technology Class of 2018 IIIT Bhubaneswar From djkonro35 at gmail.com Mon Mar 14 04:46:31 2016 From: djkonro35 at gmail.com (Djimeli Konrad) Date: Mon, 14 Mar 2016 09:46:31 +0100 Subject: [pypy-dev] Fwd: Interest in contributing to PYPY In-Reply-To: References: Message-ID: Hello, As discoursed on IRC, I am trying to develop a parser for Jitviewer, that is not dependent on rpython for my first patch. I am new to Pypy and I would like to get some help/pointer that would help me accomplish this task. Mainly resources on how log files are generated. I would also like to get more details on what improvements are to be done with respect Jitviewer, for GSOC 2016, as application are about to start. So far in trying to generate a log file, I have tried the following commands; PYPYLOG=jit-backend:/home/konro/jitviewer/logfile pypy ../source.py (to generate the log file) and I got the following output http://pastebin.com/xv7nS1i2 But when I try to view the file with Jitviewer, I get errors http://pastebin.com/LFBB12sj Please I need some help to identify what I am doing wrong. Thanks Konrad From nzinov at gmail.com Mon Mar 14 14:15:45 2016 From: nzinov at gmail.com (=?UTF-8?B?0J3QuNC60L7Qu9Cw0Lkg0JfQuNC90L7Qsg==?=) Date: Mon, 14 Mar 2016 18:15:45 +0000 Subject: [pypy-dev] Copy-on-write list slicing as GSoC project Message-ID: Hello dear PyPy developers, My name is Nikolay Zinov. I am a sophomore student at Moscow Institute of Physics and Technology. I am very interested in contributing to PyPy as a GSoC project. I found implementing copy-on-write list slicing particularly interesting for me. Below go my ideas. Note, that at some places I see different possible choices so I need feedback. 1. What we want to get is *myslice = mylist[a:b]* only cause data copying if *myslice* or *mylist* are mutated. 2. This can be implemented by creating a special list strategy. When getslice operation is performed, the original list is switched to that strategy and a new list with shared storage is created. Storage layout is a tuple of reference counter and the underlying RPython list. This storage would be shared between several W_ListObject instances. A field containing slice object representing would be added to the W_ListObject. List operations are implemented as follows: non modifying ops perform indices conversion and proxy the call to the underlying strategy; modifying ops cause new list creation with normal strategy. If a slice of a slice is taken we can calculate the resulting slice of the original list. 3. Some drawbacks of this solution. a) Additional field (slice object) added to W_ListObject. Another option would be to make this value a part of the storage. However, this value is unique for the slice while other data are shared. Therefore, it would require an additional level of indirection with the W_ListObject pointing to some header which in its turn points to shared data. b) If the original list is modified it is copied and not the (probably smaller) slice. The solution would be quite complicated with the original list storing references to all its slices. The good thing is that this scenario (create a slice -> modify the original list) is quite rare (or it would be if not for the next problem). c) Copy-on-write is inefficient in a GC'd environment. Abandoned slice can take a while to be freed and till then it will block modifying operations on the original list. I see no good solution for this problem but for keeping reference counter in the slice instance which is probably not a good idea. 4. With regard to the last problem it is interesting to consider omitting reference counter on the shared data and copy always. It would save another level of indirection but have little impact on the performance if the slices are not freed anyway and save another level of indirection. 5. Benchmark should be done to find out the cutoff length on which this strategy gives performance benefit over blind copying. Please give me your feedback on this idea and feasibility of its becoming a GSoC project. Cheers, Nikolay Zinov nzinov at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From m at magnusmorton.com Mon Mar 14 22:32:45 2016 From: m at magnusmorton.com (Magnus Morton) Date: Tue, 15 Mar 2016 02:32:45 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance Message-ID: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Hi, I?m attempting to use the JitHookInterface to implement something like the PyPy JIT hooks in pycket. However, I?m struggling to do anything other than print information to stdout. From what I understand in pypy, the pypyjit.hooks.pypy_hooks object is instantiated, and then after the ObjSpace is initialised, it is assigned to pypy_hooks.space in setup_after_space_initialization. In my case, when I assign anything to an attribute of my JitHookInterface instance, translation blows up with [translation:ERROR] MissingRTypeAttribute: on_abort [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing [translation:ERROR] .. block at 59 with 2 exits(v1678) [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) If any pycket people are reading this, what I?m trying to do at the moment is give a JitHookInterface instance access to the module table somehow. Copying the pypy JIT hooks approach is not strictly necessary - I?d be happy with being able to update anything from within a JitHookInterface callback which could then be accessed by application level code. Obviously, my understanding of what?s going on here is lacking somewhat. If anyone could point me in the correct general direction, I?d be very grateful. Best regards, Magnus From arigo at tunes.org Tue Mar 15 06:48:01 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 15 Mar 2016 11:48:01 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Message-ID: Hi Magnus, On 15 March 2016 at 03:32, Magnus Morton wrote: > [translation:ERROR] MissingRTypeAttribute: on_abort > [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing > [translation:ERROR] .. block at 59 with 2 exits(v1678) > [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) This says that 'on_abort' is not found. Are you sure you have, like pypy/module/pypyjit/hooks.py, written a JitHookInterface subclass which provides all the same 'on_*' methods? A bient?t, Armin. From m at magnusmorton.com Tue Mar 15 10:45:54 2016 From: m at magnusmorton.com (Magnus Morton) Date: Tue, 15 Mar 2016 14:45:54 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> Message-ID: <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Hi Amin, Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. Cheers, Magnus > On 15 Mar 2016, at 10:48, Armin Rigo wrote: > > Hi Magnus, > > On 15 March 2016 at 03:32, Magnus Morton wrote: >> [translation:ERROR] MissingRTypeAttribute: on_abort >> [translation:ERROR] .. (rpython.jit.metainterp.pyjitpl:2224)MetaInterp.aborted_tracing >> [translation:ERROR] .. block at 59 with 2 exits(v1678) >> [translation:ERROR] .. v1680 = getattr(v1679, ('on_abort')) > > This says that 'on_abort' is not found. Are you sure you have, like > pypy/module/pypyjit/hooks.py, written a JitHookInterface subclass > which provides all the same 'on_*' methods? > > > A bient?t, > > Armin. From arigo at tunes.org Tue Mar 15 11:32:14 2016 From: arigo at tunes.org (Armin Rigo) Date: Tue, 15 Mar 2016 16:32:14 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Message-ID: Hi Magnus, On 15 March 2016 at 15:45, Magnus Morton wrote: > Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. Can't help, I would need to reproduce the problem first. Please give step-by-step instructions about how to reach that error. Armin From m at magnusmorton.com Tue Mar 15 20:37:14 2016 From: m at magnusmorton.com (Magnus Morton) Date: Wed, 16 Mar 2016 00:37:14 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> Message-ID: <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Hi Armin, You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods from pypy.module.pypyjit.hooks import pypy_hooks pypy_hooks.foo = ?foo? What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. Cheers, Magnus > On 15 Mar 2016, at 15:32, Armin Rigo wrote: > > Hi Magnus, > > On 15 March 2016 at 15:45, Magnus Morton wrote: >> Yes, it has all the methods defined. If I take out the assignment, but still define a JitPolicy with the hooks, it translates fine. > > Can't help, I would need to reproduce the problem first. Please give > step-by-step instructions about how to reach that error. > > > Armin From arigo at tunes.org Wed Mar 16 04:45:56 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 16 Mar 2016 09:45:56 +0100 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Message-ID: Hi Magnus, On 16 March 2016 at 01:37, Magnus Morton wrote: > You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods > > from pypy.module.pypyjit.hooks import pypy_hooks > pypy_hooks.foo = ?foo? > > What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. Reproduced and figured it out. Added some docs in eda9fd6a0601: + # WARNING: You should make a single prebuilt instance of a subclass + # of this class. You can, before translation, initialize some + # attributes on this instance, and then read or change these + # attributes inside the methods of the subclass. But this prebuilt + # instance *must not* be seen during the normal annotation/rtyping + # of the program! A line like ``pypy_hooks.foo = ...`` must not + # appear inside your interpreter's RPython code. In PyPy, setup_after_space_initialization() is not RPython (which means it is executed before translation). A bient?t, Armin. From m at magnusmorton.com Wed Mar 16 07:34:55 2016 From: m at magnusmorton.com (Magnus Morton) Date: Wed, 16 Mar 2016 11:34:55 +0000 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Message-ID: Hi Armin, Thanks for looking into this. Is this pre-translation code a general thing possible with any RPython based compiler, or is it very PyPy specific? Cheers, Magnus > On 16 Mar 2016, at 08:45, Armin Rigo wrote: > > Hi Magnus, > > On 16 March 2016 at 01:37, Magnus Morton wrote: >> You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods >> >> from pypy.module.pypyjit.hooks import pypy_hooks >> pypy_hooks.foo = ?foo? >> >> What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. > > Reproduced and figured it out. Added some docs in eda9fd6a0601: > > + # WARNING: You should make a single prebuilt instance of a subclass > + # of this class. You can, before translation, initialize some > + # attributes on this instance, and then read or change these > + # attributes inside the methods of the subclass. But this prebuilt > + # instance *must not* be seen during the normal annotation/rtyping > + # of the program! A line like ``pypy_hooks.foo = ...`` must not > + # appear inside your interpreter's RPython code. > > In PyPy, setup_after_space_initialization() is not RPython (which means > it is executed before translation). > > > A bient?t, > > Armin. From fijall at gmail.com Wed Mar 16 07:59:52 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 16 Mar 2016 13:59:52 +0200 Subject: [pypy-dev] setting attribute of JitHookInterface instance In-Reply-To: References: <13C58648-F51F-46B5-A14F-C5E23CE5ACA2@magnusmorton.com> <495E75D7-C643-4B62-BD1B-AEB47757964D@magnusmorton.com> <865995FE-2618-4927-A944-D7C047480603@magnusmorton.com> Message-ID: It's general. You can do whatever you like before runtime (during import time for example) as long as the presented world to rpython is static enough - in other words Python is a meta-programming language for RPython On Wed, Mar 16, 2016 at 1:34 PM, Magnus Morton wrote: > Hi Armin, > > Thanks for looking into this. Is this pre-translation code a general thing possible with any RPython based compiler, or is it very PyPy specific? > > Cheers, > Magnus > >> On 16 Mar 2016, at 08:45, Armin Rigo wrote: >> >> Hi Magnus, >> >> On 16 March 2016 at 01:37, Magnus Morton wrote: >>> You can recreate it in PyPy by putting the following two lines pretty much anywhere in interpreter level code other than the setup_after_space_initialization methods >>> >>> from pypy.module.pypyjit.hooks import pypy_hooks >>> pypy_hooks.foo = ?foo? >>> >>> What I can?t understand is what is special about the setup_after_space_initialization methods that makes it work there. >> >> Reproduced and figured it out. Added some docs in eda9fd6a0601: >> >> + # WARNING: You should make a single prebuilt instance of a subclass >> + # of this class. You can, before translation, initialize some >> + # attributes on this instance, and then read or change these >> + # attributes inside the methods of the subclass. But this prebuilt >> + # instance *must not* be seen during the normal annotation/rtyping >> + # of the program! A line like ``pypy_hooks.foo = ...`` must not >> + # appear inside your interpreter's RPython code. >> >> In PyPy, setup_after_space_initialization() is not RPython (which means >> it is executed before translation). >> >> >> A bient?t, >> >> Armin. > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From mount.sarah at gmail.com Wed Mar 16 12:30:04 2016 From: mount.sarah at gmail.com (Sarah Mount) Date: Wed, 16 Mar 2016 16:30:04 +0000 Subject: [pypy-dev] Software benchmarking workshop, April 20, King's College London Message-ID: Dear all, PyPy developers in the UK may be interested in this event on the topic of software benchmarking. Registration will remain open until April 6th. If you have any questions please feel free to email me directly (off list). Best Practices in Software Benchmarking 2016 (#bench16) Wednesday April 20 2016 King's College London http://soft-dev.org/events/bench16/ For computer scientists and software engineers, benchmarking (evaluating the running time of a piece of software, or the performance of a piece of hardware) is a common method for evaluating new techniques. However, there is little agreement on how benchmarking should be carried out, how to control for confounding variables, how to analyse latency data, or how to aid the repeatability of experiments. This free workshop will be a venue for computer scientists and research software engineers to discuss their current best practices and future directions. For further information and free registration please visit: http://soft-dev.org/events/bench16/ Confirmed Speakers: Jan Vitek (Northeastern University) Joe Parker (The Jodrell Laboratory, Royal Botanic Gardens) Simon Taylor (University of Lancaster) Tomas Kalibera (Northeastern University) James Davenport (University of Bath) Edd Barrett (King's College London) Jeremy Bennett (Embecosm) Organizers: Sarah Mount & Laurence Tratt (King's College London) From lists at sonnenglanz.net Wed Mar 16 12:32:05 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Wed, 16 Mar 2016 17:32:05 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: Message-ID: <56E98A85.6000503@sonnenglanz.net> Did the lxml project indicate they will provide a new release soon that incorporates these fixes? I tried to build the latest development code from source, but run into many issues (lxml build server down, source package missing the pre-generated C code etc. etc.), and customer company policy wouldn't allow using a development version in production anyway. The lxml 3.5.0 does not install with pypy-5.0.0 (it used to with pypy-4.0.1, though it was too buggy to be useful), and the lxml-cffi no longer installs. On 08-03-16 15:26, Armin Rigo wrote: > Hi Matti, > > On 8 March 2016 at 15:15, matti picus wrote: >> We could package it and upload as rc1, but version_info will not have rc1 >> unless we rerun the builds. Confusing. >> I prefer to apologize if we get it wrong and release a 5.0.1 bugfix > +1. Go ahead as far as I'm concerned. > > About the release notice: "As a result, lxml with its cython compiled > component passes all tests on PyPy" is not clear until the next > official lxml is released. The current lxml 3.5.0 still contains a > partially buggy workaround that tries to make it work on previous > versions of cpyext. The trunk version at https://github.com/lxml/lxml > has got this code removed, and that's the version that works. > > I'll make the ppc releases once the other releases are out. > > > A bient?t, > > Armin > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From sshakur.shamss at gmail.com Wed Mar 16 12:56:29 2016 From: sshakur.shamss at gmail.com (Shakur Shams) Date: Wed, 16 Mar 2016 22:56:29 +0600 Subject: [pypy-dev] GSoC 2016: Interested to work on the idea :Make bytearray type fast" Message-ID: Hi, I am Shakur Shams Mullick. I would like to participate in GSoC 2016 with PyPy. I have gone through the ideas list and would like to work on the idea to improve bytearray to perform fast ( http://doc.pypy.org/en/latest/project-ideas.html#make-bytearray-type-fast). I would like to work on this but I don't have any prior experience with PyPy. I have work experience as a professional python developer at a startup for about a year and I recently submitted a patch for cpython (not merged yet) and reported a bug. Previously I worked on util-linux also. Because I do not have prior knowledge of PyPy, I am not exactly sure how to implement this idea. That is why I would like to discuss this idea and would like someone to mentor me. Looking forward to your input. Thank you. Best regards, Shakur Shams Mullick -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Wed Mar 16 13:07:31 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 16 Mar 2016 18:07:31 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56E98A85.6000503@sonnenglanz.net> References: <56E98A85.6000503@sonnenglanz.net> Message-ID: Hi Pim, On 16 March 2016 at 17:32, Pim van der Eijk (Lists) wrote: > Did the lxml project indicate they will provide a new release soon that > incorporates these fixes? You'll have to ask on the lxml mailing list. Armin From lists at sonnenglanz.net Thu Mar 17 11:13:16 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Thu, 17 Mar 2016 16:13:16 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: <56E98A85.6000503@sonnenglanz.net> Message-ID: <56EAC98C.2040905@sonnenglanz.net> There is a new lxml release as of today, unfortunately there is an issue: https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 On 16-03-16 18:07, Armin Rigo wrote: > Hi Pim, > > On 16 March 2016 at 17:32, Pim van der Eijk (Lists) > wrote: >> Did the lxml project indicate they will provide a new release soon that >> incorporates these fixes? > You'll have to ask on the lxml mailing list. > > Armin From arigo at tunes.org Thu Mar 17 12:27:56 2016 From: arigo at tunes.org (Armin Rigo) Date: Thu, 17 Mar 2016 17:27:56 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EAC98C.2040905@sonnenglanz.net> References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: Hi, On 17 March 2016 at 16:13, Pim van der Eijk (Lists) wrote: > There is a new lxml release as of today, unfortunately there is an issue: > https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 Yes, it's what we get when both sides (lxml and pypy) are half-hearted about supporting the other. The lxml tests seem to pass, but that may be because they are small. Many bigger and longer-running processes seem to crash like that. I'm investigating. A bient?t, Armin. From florin.papa at intel.com Fri Mar 18 03:57:35 2016 From: florin.papa at intel.com (Papa, Florin) Date: Fri, 18 Mar 2016 07:57:35 +0000 Subject: [pypy-dev] Refcount garbage collector build error Message-ID: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> Hi all, This is Florin Papa from the Dynamic Scripting Languages Team at Intel Corporation. I am trying to build pypy to use the refcount garbage collector, for testing purposes. I am following the indications here [1], but the following command fails: pypy ../../rpython/bin/rpython -O2 --gc=ref targetpypystandalone with the error: [translation:ERROR] OpErrFmt: [: No module named _weakref] When I run pypy in interactive mode, "import _weakref" works fine. I encounter the same error if I try to use python to run the rpython script. Is the refcount garbage collector still supported? [1] http://doc.pypy.org/en/latest/config/translation.gc.html Regards, Florin -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Fri Mar 18 04:37:21 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Fri, 18 Mar 2016 10:37:21 +0200 Subject: [pypy-dev] Refcount garbage collector build error In-Reply-To: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> References: <3A375A669FBEFF45B6B60E689636EDCA09B8D107@IRSMSX101.ger.corp.intel.com> Message-ID: Hi Florin The refcount garbage collector is only marginally supported (as far as our tests go), it's definitely neither tested nor really supported when translated, it was always very slow for example. (and as you noticed, there is no support for weakrefs for example) On Fri, Mar 18, 2016 at 9:57 AM, Papa, Florin wrote: > Hi all, > > > > This is Florin Papa from the Dynamic Scripting Languages Team at Intel > Corporation. > > > > I am trying to build pypy to use the refcount garbage collector, for testing > purposes. I am following the indications here [1], but the following command > fails: > > > > pypy ../../rpython/bin/rpython -O2 --gc=ref targetpypystandalone > > > > with the error: > > > > [translation:ERROR] OpErrFmt: [ 0x89a68a8>: No module named _weakref] > > > > When I run pypy in interactive mode, ?import _weakref? works fine. I > encounter the same error if I try to use python to run the rpython script. > Is the refcount garbage collector still supported? > > > > [1] http://doc.pypy.org/en/latest/config/translation.gc.html > > > > Regards, > > Florin > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From arigo at tunes.org Fri Mar 18 07:13:51 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 12:13:51 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: Hi again, On 17 March 2016 at 17:27, Armin Rigo wrote: > On 17 March 2016 at 16:13, Pim van der Eijk (Lists) > wrote: >> There is a new lxml release as of today, unfortunately there is an issue: >> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 > > Yes, it's what we get when both sides (lxml and pypy) are half-hearted > about supporting the other. The lxml tests seem to pass, but that may > be because they are small. Many bigger and longer-running processes > seem to crash like that. I'm investigating. Fixed in 0173cdbbbacc, which then seems to work with lxml even on these larger examples. I'd love some more testing before we do the 5.0.1 bugfix release. Please try with a version of PyPy on the "release-5.x" branch recent enough to contain a09a60a9c381; an Ubuntu precompiled version is here: http://buildbot.pypy.org/nightly/release-5.x/pypy-c-jit-83125-a09a60a9c381-linux64.tar.bz2 A bient?t, Armin. From lists at sonnenglanz.net Fri Mar 18 08:25:23 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Fri, 18 Mar 2016 13:25:23 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> Message-ID: <56EBF3B3.5030903@sonnenglanz.net> Hi, I did some tests and there are no crashes. However, compared to CPython 2.7.10 there are some serious issues: - For my test programs (the script in the issue on BitBucket is derived from one of them), PyPy is much slower. script A: 256 seconds in PyPy versus 78 seconds in CPython script B: 9.73 seconds in PyPy versus 2.6 in Cpython - Memory use continues to grow up to over 80% at which time where my laptop starts swapping, whereas with CPython usage is never more than 4%. - Perhaps caused by the above, there are occasional freezes of several seconds in which nothing seems to happen, although CPU usage is still 100%. Kind Regards, Pim On 18-03-16 12:13, Armin Rigo wrote: > Hi again, > > On 17 March 2016 at 17:27, Armin Rigo wrote: >> On 17 March 2016 at 16:13, Pim van der Eijk (Lists) >> wrote: >>> There is a new lxml release as of today, unfortunately there is an issue: >>> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 >> Yes, it's what we get when both sides (lxml and pypy) are half-hearted >> about supporting the other. The lxml tests seem to pass, but that may >> be because they are small. Many bigger and longer-running processes >> seem to crash like that. I'm investigating. > Fixed in 0173cdbbbacc, which then seems to work with lxml even on > these larger examples. I'd love some more testing before we do the > 5.0.1 bugfix release. Please try with a version of PyPy on the > "release-5.x" branch recent enough to contain a09a60a9c381; an Ubuntu > precompiled version is here: > > http://buildbot.pypy.org/nightly/release-5.x/pypy-c-jit-83125-a09a60a9c381-linux64.tar.bz2 > > > A bient?t, > > Armin. From arigo at tunes.org Fri Mar 18 09:57:15 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 14:57:15 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EBF3B3.5030903@sonnenglanz.net> References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> Message-ID: Hi Pim, On 18 March 2016 at 13:25, Pim van der Eijk (Lists) wrote: > - For my test programs (the script in the issue on BitBucket is derived > from one of them), PyPy is much slower. If you're comparing the speed of scripts that have a large amount of crossings of the cpyext layer (i.e. crossings between Python code and CPython C extension code), then yes, it's expected to be much slower. The speed improved a lot recently, which means it is now *much slower* instead of *very, very much slower*. It makes no sense, now or in the future, to use PyPy in the hope to speed up a script that does _only_ lxml stuff with almost no Python code running in-between. > - Memory use continues to grow up to over 80% at which time where my laptop > starts swapping, whereas with CPython usage is never more than 4%. This is more annoying. Can you give us a way to reproduce this? Armin From lists at sonnenglanz.net Fri Mar 18 10:08:10 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Fri, 18 Mar 2016 15:08:10 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> Message-ID: <56EC0BCA.2080606@sonnenglanz.net> On 18-03-16 14:57, Armin Rigo wrote: >> - Memory use continues to grow up to over 80% at which time where my laptop >> starts swapping, whereas with CPython usage is never more than 4%. > This is more annoying. Can you give us a way to reproduce this? It already happens with the script I attached to the original issue, which you already have: https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 From tobias.oberstein at tavendo.de Fri Mar 18 13:08:21 2016 From: tobias.oberstein at tavendo.de (Tobias Oberstein) Date: Fri, 18 Mar 2016 18:08:21 +0100 Subject: [pypy-dev] Crossbar.io / AutobahnPython 0.13.0 In-Reply-To: <56EC34E6.6070904@gmail.com> References: <56EC34E6.6070904@gmail.com> Message-ID: <56EC3605.7050108@tavendo.de> Hi, we've released Crossbar.io and AutobahnPython 0.13.0, running on Twisted 16.0.0 and PyPy 5.0. Get it here: Source: * https://github.com/crossbario/crossbar * https://github.com/crossbario/autobahn-python Python Packages: * https://pypi.python.org/pypi/crossbar * https://pypi.python.org/pypi/autobahn Binary Packages (recommended) * http://crossbar.io/docs/Local-Installation/ The binary packages contain a complete, self-contained, optimized Crossbar.io with everything - including PyPy 5.0, and of course based on Twisted 16.0.0! These packages are available for Ubuntu, FreeBSD and CentOS. (thanks to Hawkie, Miss Amber Brown - she made that happen;) ) Cheers, /Tobias -------------- next part -------------- A non-text attachment was scrubbed... Name: Pasted image at 2016_03_18 05_41 PM.png Type: image/png Size: 170041 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Bildschirmfoto vom 2016-03-18 17:46:03.png Type: image/png Size: 221883 bytes Desc: not available URL: From arigo at tunes.org Fri Mar 18 13:52:18 2016 From: arigo at tunes.org (Armin Rigo) Date: Fri, 18 Mar 2016 18:52:18 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: <56EC0BCA.2080606@sonnenglanz.net> References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> <56EC0BCA.2080606@sonnenglanz.net> Message-ID: Hi Pim, On 18 March 2016 at 15:08, Pim van der Eijk (Lists) wrote: >>> - Memory use continues to grow up to over 80% at which time where my >>> laptop >>> starts swapping, whereas with CPython usage is never more than 4%. >> >> This is more annoying. Can you give us a way to reproduce this? > > It already happens with the script I attached to the original issue, which > you already have: > https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 Ok, partially reproduced. With CPython it grows continously too, but only up to 1.2GB and then it finishes. With PyPy it grows faster up to 22GB. If I add some "gc.collect()" executed every few seconds, then PyPy only grows up to 1.7GB. I added "add_memory_pressure=True" to some chosen mallocs inside cpyext, and it seems to be enough to fix the problem. Now PyPy grows up to 1.7GB even without any gc.collect(). Yay! (changeset 9137853fd0ec, grafted to release-5.x too) A bient?t, Armin. From lists at sonnenglanz.net Sun Mar 20 04:20:59 2016 From: lists at sonnenglanz.net (Pim van der Eijk (Lists)) Date: Sun, 20 Mar 2016 09:20:59 +0100 Subject: [pypy-dev] release seems ready In-Reply-To: References: <56E98A85.6000503@sonnenglanz.net> <56EAC98C.2040905@sonnenglanz.net> <56EBF3B3.5030903@sonnenglanz.net> <56EC0BCA.2080606@sonnenglanz.net> Message-ID: <56EE5D6B.4090701@sonnenglanz.net> Hi Armin, On 18-03-16 18:52, Armin Rigo wrote: > Hi Pim, > > On 18 March 2016 at 15:08, Pim van der Eijk (Lists) > wrote: >>>> - Memory use continues to grow up to over 80% at which time where my >>>> laptop >>>> starts swapping, whereas with CPython usage is never more than 4%. >>> This is more annoying. Can you give us a way to reproduce this? >> It already happens with the script I attached to the original issue, which >> you already have: >> https://bitbucket.org/pypy/pypy/issues/2260/pypy-500-dumps-core-with-lxml-360 > Ok, partially reproduced. With CPython it grows continously too, but > only up to 1.2GB and then it finishes. With PyPy it grows faster up > to 22GB. If I add some "gc.collect()" executed every few seconds, > then PyPy only grows up to 1.7GB. > > I added "add_memory_pressure=True" to some chosen mallocs inside > cpyext, and it seems to be enough to fix the problem. Now PyPy grows > up to 1.7GB even without any gc.collect(). Yay! (changeset > 9137853fd0ec, grafted to release-5.x too) > I retested and confirm that the library works and memory use is now like CPython, which is great. It is still slower than CPython, for reasons you explained before, but that is because my test script heavily uses of lxml. In larger applications where lxml processing is a smaller part of the overall functionality, the PyPy speed-up of regular Python code could well compensate for this. Many thanks, Pim > A bient?t, > > Armin. From tinchester at gmail.com Sun Mar 20 21:43:05 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=C4=87?=) Date: Mon, 21 Mar 2016 01:43:05 +0000 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question Message-ID: Hello, first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not demanding free labor here, just curious whether I should wait a little for 5.0 to show up there or change my Dockerfiles to direct download. second question: does PyPy support PyByteArray_CheckExact? I seem to have some Cython-generated code using it and PyPy seems to be refusing to import the resulting module. Cheers! -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Mon Mar 21 03:53:23 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Mon, 21 Mar 2016 09:53:23 +0200 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References: Message-ID: PPA is usually updated, but as you said we can't demand deadlines PyByteArray_Check and PyByteArray_CheckExact are not implemented On Mon, Mar 21, 2016 at 3:43 AM, Tin Tvrtkovi? wrote: > Hello, > > first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not > demanding free labor here, just curious whether I should wait a little for > 5.0 to show up there or change my Dockerfiles to direct download. > > second question: does PyPy support PyByteArray_CheckExact? I seem to have > some Cython-generated code using it and PyPy seems to be refusing to import > the resulting module. > > Cheers! > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From tinchester at gmail.com Mon Mar 21 05:42:01 2016 From: tinchester at gmail.com (=?UTF-8?Q?Tin_Tvrtkovi=C4=87?=) Date: Mon, 21 Mar 2016 10:42:01 +0100 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References: Message-ID: Thanks for the quick reply (as always). We'll stick with the PPA. About PyByteArray_CheckExact, any chance of it getting implemented in this next round of C-API extensions? Looking in the CPython source, it seems to be a one-line macro: #define PyByteArray_CheckExact(self) (Py_TYPE(self) == &PyByteArray_Type) but I admit to knowing basically nothing about this level of code. :) I figure asking here whether it can be implemented will be better than asking Cython to stop using it ;) Cheers! On Mon, Mar 21, 2016 at 8:53 AM, Maciej Fijalkowski wrote: > PPA is usually updated, but as you said we can't demand deadlines > > PyByteArray_Check and PyByteArray_CheckExact are not implemented > > On Mon, Mar 21, 2016 at 3:43 AM, Tin Tvrtkovi? > wrote: > > Hello, > > > > first question: is the PyPy Ubuntu PPA still a maintained thing? I'm not > > demanding free labor here, just curious whether I should wait a little > for > > 5.0 to show up there or change my Dockerfiles to direct download. > > > > second question: does PyPy support PyByteArray_CheckExact? I seem to have > > some Cython-generated code using it and PyPy seems to be refusing to > import > > the resulting module. > > > > Cheers! > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.picus at gmail.com Mon Mar 21 05:59:26 2016 From: matti.picus at gmail.com (Matti Picus) Date: Mon, 21 Mar 2016 11:59:26 +0200 Subject: [pypy-dev] PyPy Ubuntu PPA + a cpyext question In-Reply-To: References: Message-ID: <56EFC5FE.3030408@gmail.com> On 21/03/16 11:42, Tin Tvrtkovi? wrote: > Thanks for the quick reply (as always). > > We'll stick with the PPA. > > About PyByteArray_CheckExact, any chance of it getting implemented in > this next round of C-API extensions? Looking in the CPython source, it > seems to be a one-line macro: > > #define PyByteArray_CheckExact(self) (Py_TYPE(self) == &PyByteArray_Type) > > but I admit to knowing basically nothing about this level of code. :) > I figure asking here whether it can be implemented will be better than > asking Cython to stop using it ;) > > Cheers! > > mailing list > > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev While true, that would only get you to the next step, which is that much of the functionality of PyByteArray_Type is not implemented. See for instance the functions in cpyext/stubs.py or commit 16f119c9be67 which added a failing test for PyArg_ParseTuple, s*, and ByteArrays. If we were to push the CheckExact forward, what functionality is critical for cython to completely compile your module? Matti From kunalgrover05 at gmail.com Mon Mar 21 10:29:22 2016 From: kunalgrover05 at gmail.com (Kunal Grover) Date: Mon, 21 Mar 2016 19:59:22 +0530 Subject: [pypy-dev] STM improvements GSoC project Message-ID: Hi, I am interested in improvements in PyPy-STM as a GSoC project. I have discussed some ideas with Remi, and put them down here in https://docs.google.com/document/d/1ZXORu2qgX6EixCWTb--HRMIFYauoWtJIGKkTlFi8DuY/edit . It would be great if you could comment here giving your suggestions regarding the same. Also, I am unsure about how to make vmprof work with this STM, and what is the complexity involved in that. Anyone can give suggestions about the same? Thank you. Kunal -------------- next part -------------- An HTML attachment was scrubbed... URL: From njs at pobox.com Tue Mar 22 02:41:02 2016 From: njs at pobox.com (Nathaniel Smith) Date: Mon, 21 Mar 2016 23:41:02 -0700 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year Message-ID: Hi all, I wanted to announce a workshop I'm organizing at SciPy this year, and invite you to attend! What: A two-day workshop bringing together folks working on JIT/AOT compilation in Python. When/where: July 11-12, in Austin, Texas. (This is co-located with SciPy 2016, at the same time as the tutorial sessions, just before the conference proper.) Website: https://python-compilers-workshop.github.io/ Note that I anticipate that we'll be able to get sponsorship funding to cover travel costs for folks who can't get their employers to foot the bill. Cheers, -n -- Nathaniel J. Smith -- https://vorpus.org From bg379 at cornell.edu Tue Mar 22 14:56:23 2016 From: bg379 at cornell.edu (Brian Guo) Date: Tue, 22 Mar 2016 14:56:23 -0400 Subject: [pypy-dev] GSoC: Updates on ByteArray? Message-ID: Hi, My name is Brian Guo and I am currently an undergraduate at Cornell University. I am very interested in working with PyPy as part of Google's Summer of Code. In particular, I am interested in working on the bytearray project. I noticed that the current status of the ByteArray project is unknown, but that there may be updates on the mailing list. I am wondering if there is any information I may be able to read on this project, or possibly an overview of the project itself and the proposed changes that would make byteArray faster (if any have been proposed yet). I am very grateful to anyone who is able to point me in the right direction in regards to this project. Thank you all for your time, -Brian Guo -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Tue Mar 22 15:36:25 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Tue, 22 Mar 2016 21:36:25 +0200 Subject: [pypy-dev] GSoC: Updates on ByteArray? In-Reply-To: References: Message-ID: Hi Brian bytearray should be optimized for cases where you e.g. write() it to file or use read_into() in a way that does not make any copies. Same if you say convert it from ffi.buffer etc. That's probably what's missing from making it fast On Tue, Mar 22, 2016 at 8:56 PM, Brian Guo wrote: > Hi, > > My name is Brian Guo and I am currently an undergraduate at Cornell > University. I am very interested in working with PyPy as part of Google's > Summer of Code. In particular, I am interested in working on the bytearray > project. I noticed that the current status of the ByteArray project is > unknown, but that there may be updates on the mailing list. I am wondering > if there is any information I may be able to read on this project, or > possibly an overview of the project itself and the proposed changes that > would make byteArray faster (if any have been proposed yet). I am very > grateful to anyone who is able to point me in the right direction in regards > to this project. > > Thank you all for your time, > > -Brian Guo > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From john.m.camara at gmail.com Wed Mar 23 14:16:37 2016 From: john.m.camara at gmail.com (John Camara) Date: Wed, 23 Mar 2016 14:16:37 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year Message-ID: Hi Nathaniel, I would like to suggest one more topic for the workshop. I see a big need for a library (jffi) similar to cffi but that provides a bridge to Java instead of C code. The ability to seamlessly work with native Java data/code would offer a huge improvement when python code needs to work with the Spark/Hadoop ecosystem. The current mechanisms which involve serializing data to/from Java can kill performance for some applications and can render Python unsuitable for these cases. John -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Wed Mar 23 14:47:46 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Wed, 23 Mar 2016 20:47:46 +0200 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi John I understand why you're bringing this up, but it's a huge project on it's own, worth at least a couple months worth of work. Without a dedicated effort from someone I'm worried it would not go anywhere. It's kind of separated from the other goals of the summit On Wed, Mar 23, 2016 at 8:16 PM, John Camara wrote: > Hi Nathaniel, > > I would like to suggest one more topic for the workshop. I see a big need > for a library (jffi) similar to cffi but that provides a bridge to Java > instead of C code. The ability to seamlessly work with native Java data/code > would offer a huge improvement when python code needs to work with the > Spark/Hadoop ecosystem. The current mechanisms which involve serializing > data to/from Java can kill performance for some applications and can render > Python unsuitable for these cases. > > John > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From john.m.camara at gmail.com Wed Mar 23 16:22:30 2016 From: john.m.camara at gmail.com (John Camara) Date: Wed, 23 Mar 2016 16:22:30 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi Fijal, I agree that jffi would be both a large project and without someone leading it, it would likely not get any where. But I tend to disagree that it would be a separate goal for the conference. I realize the goal of the summit is to talk about native-code compilation for Python and most would argue that means executing C code, assembly, or at the very least executing code at the speed of "C code". But the reality now is, numerical/scientific programming increasingly needs executing in a clustered environment. So I think we need to be careful to not only solve yesterday's problems but make sure we are covering the current day and future ones. Today, big data and analytics, which is driving most numerical/scientific programming, is becoming almost exclusively run in a clustered environment, with the Apache Spark ecosystem as the de facto standard. A few years back, Python's ace up its sleeve for the scientific community was the numpy/scipy ecosystem but we have recently lost that edge by falling behind in clustered computing. At this point in time our best move forward on the numerical/scientific fronts is to become best buddies with the Spark ecosystem and make sure we can bring bridge the numpy/scipy ecosystem to it. That is we merge the best of both worlds and suddenly Python becomes to go to language again for numerical/scientific computing. Of course we still need to address what should have been yesterday's problem and deal with the "native-code compilation" issues. John On Wed, Mar 23, 2016 at 2:47 PM, Maciej Fijalkowski wrote: > Hi John > > I understand why you're bringing this up, but it's a huge project on > it's own, worth at least a couple months worth of work. Without a > dedicated effort from someone I'm worried it would not go anywhere. > It's kind of separated from the other goals of the summit > > On Wed, Mar 23, 2016 at 8:16 PM, John Camara > wrote: > > Hi Nathaniel, > > > > I would like to suggest one more topic for the workshop. I see a big need > > for a library (jffi) similar to cffi but that provides a bridge to Java > > instead of C code. The ability to seamlessly work with native Java > data/code > > would offer a huge improvement when python code needs to work with the > > Spark/Hadoop ecosystem. The current mechanisms which involve serializing > > data to/from Java can kill performance for some applications and can > render > > Python unsuitable for these cases. > > > > John > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From arigo at tunes.org Wed Mar 23 16:48:56 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 23 Mar 2016 21:48:56 +0100 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi John, On 23 March 2016 at 19:16, John Camara wrote: > I would like to suggest one more topic for the workshop. I see a big need > for a library (jffi) similar to cffi but that provides a bridge to Java > instead of C code. The ability to seamlessly work with native Java data/code > would offer a huge improvement (...) Isn't it what JPype does? Can you describe how it isn't suitable for your needs? A bient?t, Armin. From arigo at tunes.org Wed Mar 23 17:39:06 2016 From: arigo at tunes.org (Armin Rigo) Date: Wed, 23 Mar 2016 22:39:06 +0100 Subject: [pypy-dev] Copy-on-write list slicing as GSoC project In-Reply-To: References: Message-ID: Hi Nikolay, On 14 March 2016 at 19:15, ??????? ????? wrote: > I found implementing copy-on-write list slicing particularly interesting for > me. Below go my ideas. Note, that at some places I see different possible > choices so I need feedback. Thanks for the early proposal; you should submit it to google's system very soon. I'm sorry it didn't receive more active feedback from the main mentors. One of the reasons is that this is likely more involved than you describe. In order to efficiently implement copy-on-write list slicing, we would need some special GC support. Otherwise, as you describe, there is the problem that as soon as there exist a slice anywhere, we cannot any more modify a big list without making a copy of the whole list. Moreover, there is also the issue that if 'mylist[1:5]' is kept alive, then the whole 'mylist' is also kept alive, even if it would not be necessary; this can consume some extra memory but more importantly it can delay calling destructors for arbitrarily long periods of time. So, serious work on this topic should start with designing a usable GC interface which fixes these problems; a bit like weakrefs, which are a general GC interface. The problem is that we don't really know what such an interface could look like. A bient?t, Armin. From lac at openend.se Wed Mar 23 18:05:32 2016 From: lac at openend.se (Laura Creighton) Date: Wed, 23 Mar 2016 23:05:32 +0100 Subject: [pypy-dev] [Jython-dev] [ANN] Python compilers workshop at SciPy this year (fwd) Message-ID: <201603232205.u2NM5W16016319@theraft.openend.se> This from the Jython mailing list. Are we sending somebody? It's the first I heard about it, at any rate. Laura ------- Forwarded Message Return-Path: Received: from lists.sourceforge.net (lists.sourceforge.net [216.34.181.88]) From: Nathaniel Smith To: jython-dev at lists.sourceforge.net Subject: [Jython-dev] [ANN] Python compilers workshop at SciPy this year Hi Jython folks, I wanted to give a heads-up to a workshop I'm organizing at SciPy this year that might be of interest to you: What: A two-day workshop bringing together folks working on JIT/AOT compilation in Python. When/where: July 11-12, in Austin, Texas. (This is co-located with SciPy 2016, at the same time as the tutorial sessions, just before the conference proper.) Website: https://python-compilers-workshop.github.io/ Note that I anticipate that we'll be able to get sponsorship funding to cover travel costs for folks who can't get their employers to foot the bill. Cheers, - -n - -- Nathaniel J. Smith -- https://vorpus.org - ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 _______________________________________________ Jython-dev mailing list Jython-dev at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/jython-dev ------- End of Forwarded Message From fijall at gmail.com Wed Mar 23 18:16:35 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 24 Mar 2016 00:16:35 +0200 Subject: [pypy-dev] [Jython-dev] [ANN] Python compilers workshop at SciPy this year (fwd) In-Reply-To: <201603232205.u2NM5W16016319@theraft.openend.se> References: <201603232205.u2NM5W16016319@theraft.openend.se> Message-ID: We're probably sending myself and matti On Thu, Mar 24, 2016 at 12:05 AM, Laura Creighton wrote: > This from the Jython mailing list. Are we sending somebody? It's the > first I heard about it, at any rate. > > Laura > > ------- Forwarded Message > > Return-Path: > Received: from lists.sourceforge.net (lists.sourceforge.net [216.34.181.88]) > From: Nathaniel Smith > To: jython-dev at lists.sourceforge.net > Subject: [Jython-dev] [ANN] Python compilers workshop at SciPy this year > > Hi Jython folks, > > I wanted to give a heads-up to a workshop I'm organizing at SciPy this > year that might be of interest to you: > > What: A two-day workshop bringing together folks working on JIT/AOT > compilation in Python. > > When/where: July 11-12, in Austin, Texas. > > (This is co-located with SciPy 2016, at the same time as the tutorial > sessions, just before the conference proper.) > > Website: https://python-compilers-workshop.github.io/ > > Note that I anticipate that we'll be able to get sponsorship funding > to cover travel costs for folks who can't get their employers to foot > the bill. > > Cheers, > - -n > > - -- > Nathaniel J. Smith -- https://vorpus.org > > - ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 > _______________________________________________ > Jython-dev mailing list > Jython-dev at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/jython-dev > > ------- End of Forwarded Message > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From nzinov at gmail.com Thu Mar 24 02:03:14 2016 From: nzinov at gmail.com (=?UTF-8?B?0J3QuNC60L7Qu9Cw0Lkg0JfQuNC90L7Qsg==?=) Date: Thu, 24 Mar 2016 06:03:14 +0000 Subject: [pypy-dev] Copy-on-write list slicing as GSoC project In-Reply-To: References: Message-ID: Hi Armin, Thanks for your feedback. As you mention there needed some more research on this problem, so I think I should not apply for GSoC and rather do some work out of its scope. A special-case GC interface is interesting direction and I am going to take a look at weekrefs. Cheers, Nikolay. ??, 24 ???. 2016 ?. ? 0:39, Armin Rigo : > Hi Nikolay, > > On 14 March 2016 at 19:15, ??????? ????? wrote: > > I found implementing copy-on-write list slicing particularly interesting > for > > me. Below go my ideas. Note, that at some places I see different possible > > choices so I need feedback. > > Thanks for the early proposal; you should submit it to google's system > very soon. I'm sorry it didn't receive more active feedback from the > main mentors. One of the reasons is that this is likely more involved > than you describe. > > In order to efficiently implement copy-on-write list slicing, we would > need some special GC support. Otherwise, as you describe, there is > the problem that as soon as there exist a slice anywhere, we cannot > any more modify a big list without making a copy of the whole list. > Moreover, there is also the issue that if 'mylist[1:5]' is kept alive, > then the whole 'mylist' is also kept alive, even if it would not be > necessary; this can consume some extra memory but more importantly it > can delay calling destructors for arbitrarily long periods of time. > > So, serious work on this topic should start with designing a usable GC > interface which fixes these problems; a bit like weakrefs, which are a > general GC interface. The problem is that we don't really know what > such an interface could look like. > > > A bient?t, > > Armin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hakan.ardo at gmail.com Thu Mar 24 02:32:48 2016 From: hakan.ardo at gmail.com (Hakan Ardo) Date: Thu, 24 Mar 2016 07:32:48 +0100 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: On Mar 23, 2016 21:49, "Armin Rigo" wrote: > > Hi John, > > On 23 March 2016 at 19:16, John Camara wrote: > > I would like to suggest one more topic for the workshop. I see a big need > > for a library (jffi) similar to cffi but that provides a bridge to Java > > instead of C code. The ability to seamlessly work with native Java data/code > > would offer a huge improvement (...) > > Isn't it what JPype does? Can you describe how it isn't suitable for > your needs? There is also PyJNIus: https://pyjnius.readthedocs.org/en/latest/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.m.camara at gmail.com Thu Mar 24 08:22:33 2016 From: john.m.camara at gmail.com (John Camara) Date: Thu, 24 Mar 2016 08:22:33 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Besides JPype and PyJNIus there is also https://www.py4j.org/. I haven't heard of JPype being used in any recent projects so I assuming it is outdated by now. PyJNIus gets used but I tend to only see it used on Android projects. The Py4J project gets used often in numerical/scientific projects mainly due to it use in PySpark. The problem with all these libraries is that they don't have a way to share large amounts of memory between the JVM and Python VMs and so large chunks of data have to be copied/serialized when going between the 2 VMs. Spark is the de facto standard in clustering computing at this point in time. At a high level Spark executes code that is distributed throughout a cluster so that the code being executed is as close as possible to where the data lives so as to minimize transferring of large amounts of data. The code that needs to be executed are packaged up into units called Resilient Distributed Dataset (RDD). RDDs are lazy evaluated and are essential graphs of the operations that need to be performed on the data. They are capable of reading data from many types of sources, outputting to multiple types of sources, containing the code that needs to be executed, and are also responsible to caching or keeping results in memory for future RDDs that maybe executed. If you write all your code in Java or Scala, its execution will be performed in JVMs distributed in the cluster. On the other hand, Spark does not limit its use to only Java based languages so Python can be used. In the case of Python the PySpark library is used. When Python is used, the PySpark library can be used to define the RDDs that will be executed under the JVM. In this scenario, only if required, the final results of the calculations will end up being passed to Python. I say only if necessary as its possible the end results may just be left in memory or to create an output such as an hdfs file in hadoop and does not need to be transferred to Python. Under this scenario the code is written in Python but effectively all the "real" work is performed under the JVM. Often someone writing Python is also going to want to perform some of the operations under Python. This can be done as the RDDs that are created can contain both operations that get performed under the JVM as well as Python (and of course other languages are supported). When Python is involved Spark will start up Python VMs on the required nodes so that the Python portions of the work can be performed. The Python VMs can either be CPython, PyPy or even a mix of both CPython and PyPy. The downside to using non Java languages is the overhead of passing data between the JVM and the Python VM as the memory is not shared between the processes but instead copied/serialized between them. Because this data is copied between the 2 VMs, anyone who writes Python code for this environment always has to be conscious of the data being copied between the processes so as to not let the amount of the extra overhead become a large burden. Quite often the goal will be to first perform the bulk of the operations under the JVM and then hopefully only a smaller subset of the data will have to be processed under Python. If this can be done then the overhead can be minimized and then there is essential no down sides to using Python in the pipeline of operations. If your unfortunate and need to perform some of the processing early in the pipline under Python and worse yet if there is a need to go back and forth many times between Python and Java the overhead of coping huge amounts of data can significantly slow things down which essentially puts Python at a disadvantage to Java. If it was possible to change the model of execution such that it was possible to embed the Python VM in the JVM or vice versa and that the memory could be shared between the 2 VMs the downside of using Python in this environment would be eliminated or at the very least minimized to the point where it is no longer an issue. Thus the need for a jffi library. There is a strong desire by many to use dynamic languages in these clustered environments and Python is likely in the best position to become the language of choice due to its ability to work with C based libraries and of course its syntax. The issues that hold Python back at this point is the serialization overhead, not so great state of packaging, and not having both the speed of the JIT and complete access to numpy/scipy ecosystem. Luckily for Python at this point there is no other dynamic language that is a clear winner today. But if too much time passes before these issues are solved I'm sure another language will step up to the plate. At this point my expectations is that Node could likely make a move. It already has the speed due to the Java Script JITs, it already has a great story for packaging and deployment, and its growth is exploding on the server side due to all the money being poured into it. What it strongly lacks today is the connection to C/legacy code, numerical/scientific modules and of course it also does not have a solution to the data copying overhead it also has with the JVM. Any way, this is just my 2 cents on what is currently holding Python back from taking off in this space. On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo wrote: > > On Mar 23, 2016 21:49, "Armin Rigo" wrote: > > > > Hi John, > > > > On 23 March 2016 at 19:16, John Camara wrote: > > > I would like to suggest one more topic for the workshop. I see a big > need > > > for a library (jffi) similar to cffi but that provides a bridge to Java > > > instead of C code. The ability to seamlessly work with native Java > data/code > > > would offer a huge improvement (...) > > > > Isn't it what JPype does? Can you describe how it isn't suitable for > > your needs? > > There is also PyJNIus: > > https://pyjnius.readthedocs.org/en/latest/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 24 08:56:53 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 24 Mar 2016 14:56:53 +0200 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi John Thanks for explaining the current situation of the ecosystem. I'm not quite sure what your intention is. PyPy (and CPython) is very easy to embed through any C-level API, especially with the latest additions to cffi embedding. If someone feels like doing the work to share stuff that way (as I presume a lot of data presented in JVM can be represented as some pointer and shape how to access it), then he's obviously more than free to do so, I'm even willing to help with that. Now this seems like a medium-to-big size project that additionally will require quite a bit of community will to endorse. Are you willing to volunteer to work on such a project and dedicate a lot of time to it? If not, then there is no way you can convince us to volunteer our own time to do it - it's just too big and quite a bit far out of our usual areas of interest. If there is some commercial interest (and I think there might be) in pushing python and especially pypy further in that area, we might want to have a better story for numpy first, but then feel free to send those corporate interest people my way, we can maybe organize something. If you want us to do community service to push Python solutions in the area I have very little clue about however, I would like to politely decline. Cheers, fijal On Thu, Mar 24, 2016 at 2:22 PM, John Camara wrote: > Besides JPype and PyJNIus there is also https://www.py4j.org/. I haven't > heard of JPype being used in any recent projects so I assuming it is > outdated by now. PyJNIus gets used but I tend to only see it used on > Android projects. The Py4J project gets used often in numerical/scientific > projects mainly due to it use in PySpark. The problem with all these > libraries is that they don't have a way to share large amounts of memory > between the JVM and Python VMs and so large chunks of data have to be > copied/serialized when going between the 2 VMs. > > Spark is the de facto standard in clustering computing at this point in > time. At a high level Spark executes code that is distributed throughout a > cluster so that the code being executed is as close as possible to where the > data lives so as to minimize transferring of large amounts of data. The > code that needs to be executed are packaged up into units called Resilient > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential graphs > of the operations that need to be performed on the data. They are capable > of reading data from many types of sources, outputting to multiple types of > sources, containing the code that needs to be executed, and are also > responsible to caching or keeping results in memory for future RDDs that > maybe executed. > > If you write all your code in Java or Scala, its execution will be performed > in JVMs distributed in the cluster. On the other hand, Spark does not limit > its use to only Java based languages so Python can be used. In the case of > Python the PySpark library is used. When Python is used, the PySpark > library can be used to define the RDDs that will be executed under the JVM. > In this scenario, only if required, the final results of the calculations > will end up being passed to Python. I say only if necessary as its possible > the end results may just be left in memory or to create an output such as an > hdfs file in hadoop and does not need to be transferred to Python. Under > this scenario the code is written in Python but effectively all the "real" > work is performed under the JVM. > > Often someone writing Python is also going to want to perform some of the > operations under Python. This can be done as the RDDs that are created can > contain both operations that get performed under the JVM as well as Python > (and of course other languages are supported). When Python is involved > Spark will start up Python VMs on the required nodes so that the Python > portions of the work can be performed. The Python VMs can either be > CPython, PyPy or even a mix of both CPython and PyPy. The downside to using > non Java languages is the overhead of passing data between the JVM and the > Python VM as the memory is not shared between the processes but instead > copied/serialized between them. > > Because this data is copied between the 2 VMs, anyone who writes Python code > for this environment always has to be conscious of the data being copied > between the processes so as to not let the amount of the extra overhead > become a large burden. Quite often the goal will be to first perform the > bulk of the operations under the JVM and then hopefully only a smaller > subset of the data will have to be processed under Python. If this can be > done then the overhead can be minimized and then there is essential no down > sides to using Python in the pipeline of operations. > > If your unfortunate and need to perform some of the processing early in the > pipline under Python and worse yet if there is a need to go back and forth > many times between Python and Java the overhead of coping huge amounts of > data can significantly slow things down which essentially puts Python at a > disadvantage to Java. > > If it was possible to change the model of execution such that it was > possible to embed the Python VM in the JVM or vice versa and that the memory > could be shared between the 2 VMs the downside of using Python in this > environment would be eliminated or at the very least minimized to the point > where it is no longer an issue. Thus the need for a jffi library. > > There is a strong desire by many to use dynamic languages in these clustered > environments and Python is likely in the best position to become the > language of choice due to its ability to work with C based libraries and of > course its syntax. The issues that hold Python back at this point is the > serialization overhead, not so great state of packaging, and not having both > the speed of the JIT and complete access to numpy/scipy ecosystem. > > Luckily for Python at this point there is no other dynamic language that is > a clear winner today. But if too much time passes before these issues are > solved I'm sure another language will step up to the plate. At this point > my expectations is that Node could likely make a move. It already has the > speed due to the Java Script JITs, it already has a great story for > packaging and deployment, and its growth is exploding on the server side due > to all the money being poured into it. What it strongly lacks today is the > connection to C/legacy code, numerical/scientific modules and of course it > also does not have a solution to the data copying overhead it also has with > the JVM. > > Any way, this is just my 2 cents on what is currently holding Python back > from taking off in this space. > > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo wrote: >> >> >> On Mar 23, 2016 21:49, "Armin Rigo" wrote: >> > >> > Hi John, >> > >> > On 23 March 2016 at 19:16, John Camara wrote: >> > > I would like to suggest one more topic for the workshop. I see a big >> > > need >> > > for a library (jffi) similar to cffi but that provides a bridge to >> > > Java >> > > instead of C code. The ability to seamlessly work with native Java >> > > data/code >> > > would offer a huge improvement (...) >> > >> > Isn't it what JPype does? Can you describe how it isn't suitable for >> > your needs? >> >> There is also PyJNIus: >> >> https://pyjnius.readthedocs.org/en/latest/ > > > > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev > From john.m.camara at gmail.com Thu Mar 24 11:23:54 2016 From: john.m.camara at gmail.com (John Camara) Date: Thu, 24 Mar 2016 11:23:54 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi Fijal, I understand where your coming from and not trying to convince you to work on it. Just mainly trying to point out a need that may not be obvious to this community. I don't spend much time on big data and analytics so I don't have a lot of time to devote to this task. That could change in the future so you never know I may end up getting involved with this. At the end of the day I think it is the PSF, which needs to do an honest assessment of the current state of Python and in programming in general, so that they can help direct the future of Python. I think with an honest assessment it should be clear that it is absolutely necessary that a dynamic language have a JIT. Otherwise, a language like Node would not be growing so quickly on the server side. An honest assessment would conclude that Python needs to play a major role in big data and analytics as we don't want this to be another area where Python misses the boat. As with all languages other than JavaScript we missed playing an important role on web front end. More recently we missed out on mobile. I don't think it is good for us to miss out on big data. It would be a shame since we had such a strong scientific community which initially gave us a huge advantage over other communities. Missing out on big data might also be the driver that moves the scientific community in a different direction which would be a big loss to Python. I personally don't see any particular companies or industries that are willing to fund the tasks needed to solve these issues. It's not to say there are no more funds for Python projects its just likely no one company will be willing to fund these kinds of projects on their own. It really needs the PSF to coordinate these efforts but they seamed to be more focus on trying to make Python 3 a success instead of improving the overall health of the community. I believe that Python is in pretty good shape in being able to solve these issues but it just needs some funding and focus to get there. Hopefully the workshop will be successful and help create some focus. John On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski wrote: > Hi John > > Thanks for explaining the current situation of the ecosystem. I'm not > quite sure what your intention is. PyPy (and CPython) is very easy to > embed through any C-level API, especially with the latest additions to > cffi embedding. If someone feels like doing the work to share stuff > that way (as I presume a lot of data presented in JVM can be > represented as some pointer and shape how to access it), then he's > obviously more than free to do so, I'm even willing to help with that. > Now this seems like a medium-to-big size project that additionally > will require quite a bit of community will to endorse. Are you willing > to volunteer to work on such a project and dedicate a lot of time to > it? If not, then there is no way you can convince us to volunteer our > own time to do it - it's just too big and quite a bit far out of our > usual areas of interest. If there is some commercial interest (and I > think there might be) in pushing python and especially pypy further in > that area, we might want to have a better story for numpy first, but > then feel free to send those corporate interest people my way, we can > maybe organize something. If you want us to do community service to > push Python solutions in the area I have very little clue about > however, I would like to politely decline. > > Cheers, > fijal > > On Thu, Mar 24, 2016 at 2:22 PM, John Camara > wrote: > > Besides JPype and PyJNIus there is also https://www.py4j.org/. I > haven't > > heard of JPype being used in any recent projects so I assuming it is > > outdated by now. PyJNIus gets used but I tend to only see it used on > > Android projects. The Py4J project gets used often in > numerical/scientific > > projects mainly due to it use in PySpark. The problem with all these > > libraries is that they don't have a way to share large amounts of memory > > between the JVM and Python VMs and so large chunks of data have to be > > copied/serialized when going between the 2 VMs. > > > > Spark is the de facto standard in clustering computing at this point in > > time. At a high level Spark executes code that is distributed > throughout a > > cluster so that the code being executed is as close as possible to where > the > > data lives so as to minimize transferring of large amounts of data. The > > code that needs to be executed are packaged up into units called > Resilient > > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential > graphs > > of the operations that need to be performed on the data. They are > capable > > of reading data from many types of sources, outputting to multiple types > of > > sources, containing the code that needs to be executed, and are also > > responsible to caching or keeping results in memory for future RDDs that > > maybe executed. > > > > If you write all your code in Java or Scala, its execution will be > performed > > in JVMs distributed in the cluster. On the other hand, Spark does not > limit > > its use to only Java based languages so Python can be used. In the case > of > > Python the PySpark library is used. When Python is used, the PySpark > > library can be used to define the RDDs that will be executed under the > JVM. > > In this scenario, only if required, the final results of the calculations > > will end up being passed to Python. I say only if necessary as its > possible > > the end results may just be left in memory or to create an output such > as an > > hdfs file in hadoop and does not need to be transferred to Python. Under > > this scenario the code is written in Python but effectively all the > "real" > > work is performed under the JVM. > > > > Often someone writing Python is also going to want to perform some of the > > operations under Python. This can be done as the RDDs that are created > can > > contain both operations that get performed under the JVM as well as > Python > > (and of course other languages are supported). When Python is involved > > Spark will start up Python VMs on the required nodes so that the Python > > portions of the work can be performed. The Python VMs can either be > > CPython, PyPy or even a mix of both CPython and PyPy. The downside to > using > > non Java languages is the overhead of passing data between the JVM and > the > > Python VM as the memory is not shared between the processes but instead > > copied/serialized between them. > > > > Because this data is copied between the 2 VMs, anyone who writes Python > code > > for this environment always has to be conscious of the data being copied > > between the processes so as to not let the amount of the extra overhead > > become a large burden. Quite often the goal will be to first perform the > > bulk of the operations under the JVM and then hopefully only a smaller > > subset of the data will have to be processed under Python. If this can > be > > done then the overhead can be minimized and then there is essential no > down > > sides to using Python in the pipeline of operations. > > > > If your unfortunate and need to perform some of the processing early in > the > > pipline under Python and worse yet if there is a need to go back and > forth > > many times between Python and Java the overhead of coping huge amounts of > > data can significantly slow things down which essentially puts Python at > a > > disadvantage to Java. > > > > If it was possible to change the model of execution such that it was > > possible to embed the Python VM in the JVM or vice versa and that the > memory > > could be shared between the 2 VMs the downside of using Python in this > > environment would be eliminated or at the very least minimized to the > point > > where it is no longer an issue. Thus the need for a jffi library. > > > > There is a strong desire by many to use dynamic languages in these > clustered > > environments and Python is likely in the best position to become the > > language of choice due to its ability to work with C based libraries and > of > > course its syntax. The issues that hold Python back at this point is the > > serialization overhead, not so great state of packaging, and not having > both > > the speed of the JIT and complete access to numpy/scipy ecosystem. > > > > Luckily for Python at this point there is no other dynamic language that > is > > a clear winner today. But if too much time passes before these issues > are > > solved I'm sure another language will step up to the plate. At this > point > > my expectations is that Node could likely make a move. It already has > the > > speed due to the Java Script JITs, it already has a great story for > > packaging and deployment, and its growth is exploding on the server side > due > > to all the money being poured into it. What it strongly lacks today is > the > > connection to C/legacy code, numerical/scientific modules and of course > it > > also does not have a solution to the data copying overhead it also has > with > > the JVM. > > > > Any way, this is just my 2 cents on what is currently holding Python back > > from taking off in this space. > > > > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo > wrote: > >> > >> > >> On Mar 23, 2016 21:49, "Armin Rigo" wrote: > >> > > >> > Hi John, > >> > > >> > On 23 March 2016 at 19:16, John Camara > wrote: > >> > > I would like to suggest one more topic for the workshop. I see a big > >> > > need > >> > > for a library (jffi) similar to cffi but that provides a bridge to > >> > > Java > >> > > instead of C code. The ability to seamlessly work with native Java > >> > > data/code > >> > > would offer a huge improvement (...) > >> > > >> > Isn't it what JPype does? Can you describe how it isn't suitable for > >> > your needs? > >> > >> There is also PyJNIus: > >> > >> https://pyjnius.readthedocs.org/en/latest/ > > > > > > > > _______________________________________________ > > pypy-dev mailing list > > pypy-dev at python.org > > https://mail.python.org/mailman/listinfo/pypy-dev > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 24 11:32:05 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 24 Mar 2016 17:32:05 +0200 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Ok fine, but we're not the receipents of such a message. Please lobby PSF for having a JIT, we all support that :-) On Thu, Mar 24, 2016 at 5:23 PM, John Camara wrote: > Hi Fijal, > > I understand where your coming from and not trying to convince you to work > on it. Just mainly trying to point out a need that may not be obvious to > this community. I don't spend much time on big data and analytics so I > don't have a lot of time to devote to this task. That could change in the > future so you never know I may end up getting involved with this. > > At the end of the day I think it is the PSF, which needs to do an honest > assessment of the current state of Python and in programming in general, so > that they can help direct the future of Python. I think with an honest > assessment it should be clear that it is absolutely necessary that a dynamic > language have a JIT. Otherwise, a language like Node would not be growing so > quickly on the server side. An honest assessment would conclude that Python > needs to play a major role in big data and analytics as we don't want this > to be another area where Python misses the boat. As with all languages > other than JavaScript we missed playing an important role on web front end. > More recently we missed out on mobile. I don't think it is good for us to > miss out on big data. It would be a shame since we had such a strong > scientific community which initially gave us a huge advantage over other > communities. Missing out on big data might also be the driver that moves > the scientific community in a different direction which would be a big loss > to Python. > > I personally don't see any particular companies or industries that are > willing to fund the tasks needed to solve these issues. It's not to say > there are no more funds for Python projects its just likely no one company > will be willing to fund these kinds of projects on their own. It really > needs the PSF to coordinate these efforts but they seamed to be more focus > on trying to make Python 3 a success instead of improving the overall health > of the community. > > I believe that Python is in pretty good shape in being able to solve these > issues but it just needs some funding and focus to get there. > > Hopefully the workshop will be successful and help create some focus. > > John > > On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski > wrote: >> >> Hi John >> >> Thanks for explaining the current situation of the ecosystem. I'm not >> quite sure what your intention is. PyPy (and CPython) is very easy to >> embed through any C-level API, especially with the latest additions to >> cffi embedding. If someone feels like doing the work to share stuff >> that way (as I presume a lot of data presented in JVM can be >> represented as some pointer and shape how to access it), then he's >> obviously more than free to do so, I'm even willing to help with that. >> Now this seems like a medium-to-big size project that additionally >> will require quite a bit of community will to endorse. Are you willing >> to volunteer to work on such a project and dedicate a lot of time to >> it? If not, then there is no way you can convince us to volunteer our >> own time to do it - it's just too big and quite a bit far out of our >> usual areas of interest. If there is some commercial interest (and I >> think there might be) in pushing python and especially pypy further in >> that area, we might want to have a better story for numpy first, but >> then feel free to send those corporate interest people my way, we can >> maybe organize something. If you want us to do community service to >> push Python solutions in the area I have very little clue about >> however, I would like to politely decline. >> >> Cheers, >> fijal >> >> On Thu, Mar 24, 2016 at 2:22 PM, John Camara >> wrote: >> > Besides JPype and PyJNIus there is also https://www.py4j.org/. I >> > haven't >> > heard of JPype being used in any recent projects so I assuming it is >> > outdated by now. PyJNIus gets used but I tend to only see it used on >> > Android projects. The Py4J project gets used often in >> > numerical/scientific >> > projects mainly due to it use in PySpark. The problem with all these >> > libraries is that they don't have a way to share large amounts of memory >> > between the JVM and Python VMs and so large chunks of data have to be >> > copied/serialized when going between the 2 VMs. >> > >> > Spark is the de facto standard in clustering computing at this point in >> > time. At a high level Spark executes code that is distributed >> > throughout a >> > cluster so that the code being executed is as close as possible to where >> > the >> > data lives so as to minimize transferring of large amounts of data. The >> > code that needs to be executed are packaged up into units called >> > Resilient >> > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential >> > graphs >> > of the operations that need to be performed on the data. They are >> > capable >> > of reading data from many types of sources, outputting to multiple types >> > of >> > sources, containing the code that needs to be executed, and are also >> > responsible to caching or keeping results in memory for future RDDs that >> > maybe executed. >> > >> > If you write all your code in Java or Scala, its execution will be >> > performed >> > in JVMs distributed in the cluster. On the other hand, Spark does not >> > limit >> > its use to only Java based languages so Python can be used. In the case >> > of >> > Python the PySpark library is used. When Python is used, the PySpark >> > library can be used to define the RDDs that will be executed under the >> > JVM. >> > In this scenario, only if required, the final results of the >> > calculations >> > will end up being passed to Python. I say only if necessary as its >> > possible >> > the end results may just be left in memory or to create an output such >> > as an >> > hdfs file in hadoop and does not need to be transferred to Python. Under >> > this scenario the code is written in Python but effectively all the >> > "real" >> > work is performed under the JVM. >> > >> > Often someone writing Python is also going to want to perform some of >> > the >> > operations under Python. This can be done as the RDDs that are created >> > can >> > contain both operations that get performed under the JVM as well as >> > Python >> > (and of course other languages are supported). When Python is involved >> > Spark will start up Python VMs on the required nodes so that the Python >> > portions of the work can be performed. The Python VMs can either be >> > CPython, PyPy or even a mix of both CPython and PyPy. The downside to >> > using >> > non Java languages is the overhead of passing data between the JVM and >> > the >> > Python VM as the memory is not shared between the processes but instead >> > copied/serialized between them. >> > >> > Because this data is copied between the 2 VMs, anyone who writes Python >> > code >> > for this environment always has to be conscious of the data being copied >> > between the processes so as to not let the amount of the extra overhead >> > become a large burden. Quite often the goal will be to first perform >> > the >> > bulk of the operations under the JVM and then hopefully only a smaller >> > subset of the data will have to be processed under Python. If this can >> > be >> > done then the overhead can be minimized and then there is essential no >> > down >> > sides to using Python in the pipeline of operations. >> > >> > If your unfortunate and need to perform some of the processing early in >> > the >> > pipline under Python and worse yet if there is a need to go back and >> > forth >> > many times between Python and Java the overhead of coping huge amounts >> > of >> > data can significantly slow things down which essentially puts Python at >> > a >> > disadvantage to Java. >> > >> > If it was possible to change the model of execution such that it was >> > possible to embed the Python VM in the JVM or vice versa and that the >> > memory >> > could be shared between the 2 VMs the downside of using Python in this >> > environment would be eliminated or at the very least minimized to the >> > point >> > where it is no longer an issue. Thus the need for a jffi library. >> > >> > There is a strong desire by many to use dynamic languages in these >> > clustered >> > environments and Python is likely in the best position to become the >> > language of choice due to its ability to work with C based libraries and >> > of >> > course its syntax. The issues that hold Python back at this point is >> > the >> > serialization overhead, not so great state of packaging, and not having >> > both >> > the speed of the JIT and complete access to numpy/scipy ecosystem. >> > >> > Luckily for Python at this point there is no other dynamic language that >> > is >> > a clear winner today. But if too much time passes before these issues >> > are >> > solved I'm sure another language will step up to the plate. At this >> > point >> > my expectations is that Node could likely make a move. It already has >> > the >> > speed due to the Java Script JITs, it already has a great story for >> > packaging and deployment, and its growth is exploding on the server side >> > due >> > to all the money being poured into it. What it strongly lacks today is >> > the >> > connection to C/legacy code, numerical/scientific modules and of course >> > it >> > also does not have a solution to the data copying overhead it also has >> > with >> > the JVM. >> > >> > Any way, this is just my 2 cents on what is currently holding Python >> > back >> > from taking off in this space. >> > >> > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo >> > wrote: >> >> >> >> >> >> On Mar 23, 2016 21:49, "Armin Rigo" wrote: >> >> > >> >> > Hi John, >> >> > >> >> > On 23 March 2016 at 19:16, John Camara >> >> > wrote: >> >> > > I would like to suggest one more topic for the workshop. I see a >> >> > > big >> >> > > need >> >> > > for a library (jffi) similar to cffi but that provides a bridge to >> >> > > Java >> >> > > instead of C code. The ability to seamlessly work with native Java >> >> > > data/code >> >> > > would offer a huge improvement (...) >> >> > >> >> > Isn't it what JPype does? Can you describe how it isn't suitable for >> >> > your needs? >> >> >> >> There is also PyJNIus: >> >> >> >> https://pyjnius.readthedocs.org/en/latest/ >> > >> > >> > >> > _______________________________________________ >> > pypy-dev mailing list >> > pypy-dev at python.org >> > https://mail.python.org/mailman/listinfo/pypy-dev >> > > > From arigo at tunes.org Thu Mar 24 12:20:57 2016 From: arigo at tunes.org (Armin Rigo) Date: Thu, 24 Mar 2016 17:20:57 +0100 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi John, On 24 March 2016 at 13:22, John Camara wrote: > (...) Thus the need for a jffi library. When I hear "a jffi library" I'm thinking about a new library with a new API. I think what you would really like instead is to keep the existing libraries, but adapt them internally to allow tighter execution of the Python and Java VMs. I may be completely wrong about that, but you're also talking to the wrong guys in the first place :-) A bient?t, Armin. From dje.gcc at gmail.com Thu Mar 24 12:31:46 2016 From: dje.gcc at gmail.com (David Edelsohn) Date: Thu, 24 Mar 2016 12:31:46 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Maciej, How about a little more useful response of "we'll help you find the right audience for this discussion and collaborate with you to make the case."? - David On Thu, Mar 24, 2016 at 11:32 AM, Maciej Fijalkowski wrote: > Ok fine, but we're not the receipents of such a message. > > Please lobby PSF for having a JIT, we all support that :-) > > On Thu, Mar 24, 2016 at 5:23 PM, John Camara wrote: >> Hi Fijal, >> >> I understand where your coming from and not trying to convince you to work >> on it. Just mainly trying to point out a need that may not be obvious to >> this community. I don't spend much time on big data and analytics so I >> don't have a lot of time to devote to this task. That could change in the >> future so you never know I may end up getting involved with this. >> >> At the end of the day I think it is the PSF, which needs to do an honest >> assessment of the current state of Python and in programming in general, so >> that they can help direct the future of Python. I think with an honest >> assessment it should be clear that it is absolutely necessary that a dynamic >> language have a JIT. Otherwise, a language like Node would not be growing so >> quickly on the server side. An honest assessment would conclude that Python >> needs to play a major role in big data and analytics as we don't want this >> to be another area where Python misses the boat. As with all languages >> other than JavaScript we missed playing an important role on web front end. >> More recently we missed out on mobile. I don't think it is good for us to >> miss out on big data. It would be a shame since we had such a strong >> scientific community which initially gave us a huge advantage over other >> communities. Missing out on big data might also be the driver that moves >> the scientific community in a different direction which would be a big loss >> to Python. >> >> I personally don't see any particular companies or industries that are >> willing to fund the tasks needed to solve these issues. It's not to say >> there are no more funds for Python projects its just likely no one company >> will be willing to fund these kinds of projects on their own. It really >> needs the PSF to coordinate these efforts but they seamed to be more focus >> on trying to make Python 3 a success instead of improving the overall health >> of the community. >> >> I believe that Python is in pretty good shape in being able to solve these >> issues but it just needs some funding and focus to get there. >> >> Hopefully the workshop will be successful and help create some focus. >> >> John >> >> On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski >> wrote: >>> >>> Hi John >>> >>> Thanks for explaining the current situation of the ecosystem. I'm not >>> quite sure what your intention is. PyPy (and CPython) is very easy to >>> embed through any C-level API, especially with the latest additions to >>> cffi embedding. If someone feels like doing the work to share stuff >>> that way (as I presume a lot of data presented in JVM can be >>> represented as some pointer and shape how to access it), then he's >>> obviously more than free to do so, I'm even willing to help with that. >>> Now this seems like a medium-to-big size project that additionally >>> will require quite a bit of community will to endorse. Are you willing >>> to volunteer to work on such a project and dedicate a lot of time to >>> it? If not, then there is no way you can convince us to volunteer our >>> own time to do it - it's just too big and quite a bit far out of our >>> usual areas of interest. If there is some commercial interest (and I >>> think there might be) in pushing python and especially pypy further in >>> that area, we might want to have a better story for numpy first, but >>> then feel free to send those corporate interest people my way, we can >>> maybe organize something. If you want us to do community service to >>> push Python solutions in the area I have very little clue about >>> however, I would like to politely decline. >>> >>> Cheers, >>> fijal >>> >>> On Thu, Mar 24, 2016 at 2:22 PM, John Camara >>> wrote: >>> > Besides JPype and PyJNIus there is also https://www.py4j.org/. I >>> > haven't >>> > heard of JPype being used in any recent projects so I assuming it is >>> > outdated by now. PyJNIus gets used but I tend to only see it used on >>> > Android projects. The Py4J project gets used often in >>> > numerical/scientific >>> > projects mainly due to it use in PySpark. The problem with all these >>> > libraries is that they don't have a way to share large amounts of memory >>> > between the JVM and Python VMs and so large chunks of data have to be >>> > copied/serialized when going between the 2 VMs. >>> > >>> > Spark is the de facto standard in clustering computing at this point in >>> > time. At a high level Spark executes code that is distributed >>> > throughout a >>> > cluster so that the code being executed is as close as possible to where >>> > the >>> > data lives so as to minimize transferring of large amounts of data. The >>> > code that needs to be executed are packaged up into units called >>> > Resilient >>> > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential >>> > graphs >>> > of the operations that need to be performed on the data. They are >>> > capable >>> > of reading data from many types of sources, outputting to multiple types >>> > of >>> > sources, containing the code that needs to be executed, and are also >>> > responsible to caching or keeping results in memory for future RDDs that >>> > maybe executed. >>> > >>> > If you write all your code in Java or Scala, its execution will be >>> > performed >>> > in JVMs distributed in the cluster. On the other hand, Spark does not >>> > limit >>> > its use to only Java based languages so Python can be used. In the case >>> > of >>> > Python the PySpark library is used. When Python is used, the PySpark >>> > library can be used to define the RDDs that will be executed under the >>> > JVM. >>> > In this scenario, only if required, the final results of the >>> > calculations >>> > will end up being passed to Python. I say only if necessary as its >>> > possible >>> > the end results may just be left in memory or to create an output such >>> > as an >>> > hdfs file in hadoop and does not need to be transferred to Python. Under >>> > this scenario the code is written in Python but effectively all the >>> > "real" >>> > work is performed under the JVM. >>> > >>> > Often someone writing Python is also going to want to perform some of >>> > the >>> > operations under Python. This can be done as the RDDs that are created >>> > can >>> > contain both operations that get performed under the JVM as well as >>> > Python >>> > (and of course other languages are supported). When Python is involved >>> > Spark will start up Python VMs on the required nodes so that the Python >>> > portions of the work can be performed. The Python VMs can either be >>> > CPython, PyPy or even a mix of both CPython and PyPy. The downside to >>> > using >>> > non Java languages is the overhead of passing data between the JVM and >>> > the >>> > Python VM as the memory is not shared between the processes but instead >>> > copied/serialized between them. >>> > >>> > Because this data is copied between the 2 VMs, anyone who writes Python >>> > code >>> > for this environment always has to be conscious of the data being copied >>> > between the processes so as to not let the amount of the extra overhead >>> > become a large burden. Quite often the goal will be to first perform >>> > the >>> > bulk of the operations under the JVM and then hopefully only a smaller >>> > subset of the data will have to be processed under Python. If this can >>> > be >>> > done then the overhead can be minimized and then there is essential no >>> > down >>> > sides to using Python in the pipeline of operations. >>> > >>> > If your unfortunate and need to perform some of the processing early in >>> > the >>> > pipline under Python and worse yet if there is a need to go back and >>> > forth >>> > many times between Python and Java the overhead of coping huge amounts >>> > of >>> > data can significantly slow things down which essentially puts Python at >>> > a >>> > disadvantage to Java. >>> > >>> > If it was possible to change the model of execution such that it was >>> > possible to embed the Python VM in the JVM or vice versa and that the >>> > memory >>> > could be shared between the 2 VMs the downside of using Python in this >>> > environment would be eliminated or at the very least minimized to the >>> > point >>> > where it is no longer an issue. Thus the need for a jffi library. >>> > >>> > There is a strong desire by many to use dynamic languages in these >>> > clustered >>> > environments and Python is likely in the best position to become the >>> > language of choice due to its ability to work with C based libraries and >>> > of >>> > course its syntax. The issues that hold Python back at this point is >>> > the >>> > serialization overhead, not so great state of packaging, and not having >>> > both >>> > the speed of the JIT and complete access to numpy/scipy ecosystem. >>> > >>> > Luckily for Python at this point there is no other dynamic language that >>> > is >>> > a clear winner today. But if too much time passes before these issues >>> > are >>> > solved I'm sure another language will step up to the plate. At this >>> > point >>> > my expectations is that Node could likely make a move. It already has >>> > the >>> > speed due to the Java Script JITs, it already has a great story for >>> > packaging and deployment, and its growth is exploding on the server side >>> > due >>> > to all the money being poured into it. What it strongly lacks today is >>> > the >>> > connection to C/legacy code, numerical/scientific modules and of course >>> > it >>> > also does not have a solution to the data copying overhead it also has >>> > with >>> > the JVM. >>> > >>> > Any way, this is just my 2 cents on what is currently holding Python >>> > back >>> > from taking off in this space. >>> > >>> > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo >>> > wrote: >>> >> >>> >> >>> >> On Mar 23, 2016 21:49, "Armin Rigo" wrote: >>> >> > >>> >> > Hi John, >>> >> > >>> >> > On 23 March 2016 at 19:16, John Camara >>> >> > wrote: >>> >> > > I would like to suggest one more topic for the workshop. I see a >>> >> > > big >>> >> > > need >>> >> > > for a library (jffi) similar to cffi but that provides a bridge to >>> >> > > Java >>> >> > > instead of C code. The ability to seamlessly work with native Java >>> >> > > data/code >>> >> > > would offer a huge improvement (...) >>> >> > >>> >> > Isn't it what JPype does? Can you describe how it isn't suitable for >>> >> > your needs? >>> >> >>> >> There is also PyJNIus: >>> >> >>> >> https://pyjnius.readthedocs.org/en/latest/ >>> > >>> > >>> > >>> > _______________________________________________ >>> > pypy-dev mailing list >>> > pypy-dev at python.org >>> > https://mail.python.org/mailman/listinfo/pypy-dev >>> > >> >> > _______________________________________________ > pypy-dev mailing list > pypy-dev at python.org > https://mail.python.org/mailman/listinfo/pypy-dev From john.m.camara at gmail.com Thu Mar 24 13:11:31 2016 From: john.m.camara at gmail.com (John Camara) Date: Thu, 24 Mar 2016 13:11:31 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi Armin, At a minimum tighter execution is required as well as sharing memory. But on the other hand you have raised the bar so high with cffi, having a clean and unbloated interface, that it would be nice if a library with a similar spirit existed for java. Having support in PyPy's JIT to remove all the marshalling types would be a big plus on top of the shared memory as well as some integration between the 2 GCs would likely be required. Maybe the best approach would be a combination of existing libraries and a new interface that allows for sharing of memory. Maybe similar to numpy arrays with a better API that avoids the pit falls of numpy relying on CPython semantics/implementation details. After all the only thing that needs to be eliminated is the copying/serialization of large data arrays/structures. John On Thu, Mar 24, 2016 at 12:20 PM, Armin Rigo wrote: > Hi John, > > On 24 March 2016 at 13:22, John Camara wrote: > > (...) Thus the need for a jffi library. > > When I hear "a jffi library" I'm thinking about a new library with a > new API. I think what you would really like instead is to keep the > existing libraries, but adapt them internally to allow tighter > execution of the Python and Java VMs. > > I may be completely wrong about that, but you're also talking to the > wrong guys in the first place :-) > > > A bient?t, > > Armin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.m.camara at gmail.com Thu Mar 24 15:24:14 2016 From: john.m.camara at gmail.com (John Camara) Date: Thu, 24 Mar 2016 15:24:14 -0400 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: It turns out there is some work in progress in the Spark project to share its memory with non JVM programs. See https://issues.apache.org/jira/browse/SPARK-10399. Once this is completed it should be fairly trivial to expose it to Python and then maybe JIT integration could be discussed at that time. This is a huge step forward over sharing Java objects. From the title of the ticket it appears it would be a c++ interface but looking at the pull request it looks like it will be a c interface. In the end the blocker may just come down to PyPy having complete support for Numpy. Without Numpy the success of this would be somewhat limited based on user expectations and without PyPy it maybe to slow for many applications. On Thu, Mar 24, 2016 at 1:11 PM, John Camara wrote: > Hi Armin, > > At a minimum tighter execution is required as well as sharing memory. But > on the other hand you have raised the bar so high with cffi, having a clean > and unbloated interface, that it would be nice if a library with a similar > spirit existed for java. Having support in PyPy's JIT to remove all the > marshalling types would be a big plus on top of the shared memory as well > as some integration between the 2 GCs would likely be required. > > Maybe the best approach would be a combination of existing libraries and a > new interface that allows for sharing of memory. Maybe similar to numpy > arrays with a better API that avoids the pit falls of numpy relying on > CPython semantics/implementation details. After all the only thing that > needs to be eliminated is the copying/serialization of large data > arrays/structures. > > John > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fijall at gmail.com Thu Mar 24 15:48:48 2016 From: fijall at gmail.com (Maciej Fijalkowski) Date: Thu, 24 Mar 2016 21:48:48 +0200 Subject: [pypy-dev] [ANN] Python compilers workshop at SciPy this year In-Reply-To: References: Message-ID: Hi David I'm sorry, it was not supposed to come as rude. It seems that the blocker here is full numpy support which we're working on right now, we can come back to that discussion once that's ready On Thu, Mar 24, 2016 at 6:31 PM, David Edelsohn wrote: > Maciej, > > How about a little more useful response of "we'll help you find the > right audience for this discussion and collaborate with you to make > the case."? > > - David > > On Thu, Mar 24, 2016 at 11:32 AM, Maciej Fijalkowski wrote: >> Ok fine, but we're not the receipents of such a message. >> >> Please lobby PSF for having a JIT, we all support that :-) >> >> On Thu, Mar 24, 2016 at 5:23 PM, John Camara wrote: >>> Hi Fijal, >>> >>> I understand where your coming from and not trying to convince you to work >>> on it. Just mainly trying to point out a need that may not be obvious to >>> this community. I don't spend much time on big data and analytics so I >>> don't have a lot of time to devote to this task. That could change in the >>> future so you never know I may end up getting involved with this. >>> >>> At the end of the day I think it is the PSF, which needs to do an honest >>> assessment of the current state of Python and in programming in general, so >>> that they can help direct the future of Python. I think with an honest >>> assessment it should be clear that it is absolutely necessary that a dynamic >>> language have a JIT. Otherwise, a language like Node would not be growing so >>> quickly on the server side. An honest assessment would conclude that Python >>> needs to play a major role in big data and analytics as we don't want this >>> to be another area where Python misses the boat. As with all languages >>> other than JavaScript we missed playing an important role on web front end. >>> More recently we missed out on mobile. I don't think it is good for us to >>> miss out on big data. It would be a shame since we had such a strong >>> scientific community which initially gave us a huge advantage over other >>> communities. Missing out on big data might also be the driver that moves >>> the scientific community in a different direction which would be a big loss >>> to Python. >>> >>> I personally don't see any particular companies or industries that are >>> willing to fund the tasks needed to solve these issues. It's not to say >>> there are no more funds for Python projects its just likely no one company >>> will be willing to fund these kinds of projects on their own. It really >>> needs the PSF to coordinate these efforts but they seamed to be more focus >>> on trying to make Python 3 a success instead of improving the overall health >>> of the community. >>> >>> I believe that Python is in pretty good shape in being able to solve these >>> issues but it just needs some funding and focus to get there. >>> >>> Hopefully the workshop will be successful and help create some focus. >>> >>> John >>> >>> On Thu, Mar 24, 2016 at 8:56 AM, Maciej Fijalkowski >>> wrote: >>>> >>>> Hi John >>>> >>>> Thanks for explaining the current situation of the ecosystem. I'm not >>>> quite sure what your intention is. PyPy (and CPython) is very easy to >>>> embed through any C-level API, especially with the latest additions to >>>> cffi embedding. If someone feels like doing the work to share stuff >>>> that way (as I presume a lot of data presented in JVM can be >>>> represented as some pointer and shape how to access it), then he's >>>> obviously more than free to do so, I'm even willing to help with that. >>>> Now this seems like a medium-to-big size project that additionally >>>> will require quite a bit of community will to endorse. Are you willing >>>> to volunteer to work on such a project and dedicate a lot of time to >>>> it? If not, then there is no way you can convince us to volunteer our >>>> own time to do it - it's just too big and quite a bit far out of our >>>> usual areas of interest. If there is some commercial interest (and I >>>> think there might be) in pushing python and especially pypy further in >>>> that area, we might want to have a better story for numpy first, but >>>> then feel free to send those corporate interest people my way, we can >>>> maybe organize something. If you want us to do community service to >>>> push Python solutions in the area I have very little clue about >>>> however, I would like to politely decline. >>>> >>>> Cheers, >>>> fijal >>>> >>>> On Thu, Mar 24, 2016 at 2:22 PM, John Camara >>>> wrote: >>>> > Besides JPype and PyJNIus there is also https://www.py4j.org/. I >>>> > haven't >>>> > heard of JPype being used in any recent projects so I assuming it is >>>> > outdated by now. PyJNIus gets used but I tend to only see it used on >>>> > Android projects. The Py4J project gets used often in >>>> > numerical/scientific >>>> > projects mainly due to it use in PySpark. The problem with all these >>>> > libraries is that they don't have a way to share large amounts of memory >>>> > between the JVM and Python VMs and so large chunks of data have to be >>>> > copied/serialized when going between the 2 VMs. >>>> > >>>> > Spark is the de facto standard in clustering computing at this point in >>>> > time. At a high level Spark executes code that is distributed >>>> > throughout a >>>> > cluster so that the code being executed is as close as possible to where >>>> > the >>>> > data lives so as to minimize transferring of large amounts of data. The >>>> > code that needs to be executed are packaged up into units called >>>> > Resilient >>>> > Distributed Dataset (RDD). RDDs are lazy evaluated and are essential >>>> > graphs >>>> > of the operations that need to be performed on the data. They are >>>> > capable >>>> > of reading data from many types of sources, outputting to multiple types >>>> > of >>>> > sources, containing the code that needs to be executed, and are also >>>> > responsible to caching or keeping results in memory for future RDDs that >>>> > maybe executed. >>>> > >>>> > If you write all your code in Java or Scala, its execution will be >>>> > performed >>>> > in JVMs distributed in the cluster. On the other hand, Spark does not >>>> > limit >>>> > its use to only Java based languages so Python can be used. In the case >>>> > of >>>> > Python the PySpark library is used. When Python is used, the PySpark >>>> > library can be used to define the RDDs that will be executed under the >>>> > JVM. >>>> > In this scenario, only if required, the final results of the >>>> > calculations >>>> > will end up being passed to Python. I say only if necessary as its >>>> > possible >>>> > the end results may just be left in memory or to create an output such >>>> > as an >>>> > hdfs file in hadoop and does not need to be transferred to Python. Under >>>> > this scenario the code is written in Python but effectively all the >>>> > "real" >>>> > work is performed under the JVM. >>>> > >>>> > Often someone writing Python is also going to want to perform some of >>>> > the >>>> > operations under Python. This can be done as the RDDs that are created >>>> > can >>>> > contain both operations that get performed under the JVM as well as >>>> > Python >>>> > (and of course other languages are supported). When Python is involved >>>> > Spark will start up Python VMs on the required nodes so that the Python >>>> > portions of the work can be performed. The Python VMs can either be >>>> > CPython, PyPy or even a mix of both CPython and PyPy. The downside to >>>> > using >>>> > non Java languages is the overhead of passing data between the JVM and >>>> > the >>>> > Python VM as the memory is not shared between the processes but instead >>>> > copied/serialized between them. >>>> > >>>> > Because this data is copied between the 2 VMs, anyone who writes Python >>>> > code >>>> > for this environment always has to be conscious of the data being copied >>>> > between the processes so as to not let the amount of the extra overhead >>>> > become a large burden. Quite often the goal will be to first perform >>>> > the >>>> > bulk of the operations under the JVM and then hopefully only a smaller >>>> > subset of the data will have to be processed under Python. If this can >>>> > be >>>> > done then the overhead can be minimized and then there is essential no >>>> > down >>>> > sides to using Python in the pipeline of operations. >>>> > >>>> > If your unfortunate and need to perform some of the processing early in >>>> > the >>>> > pipline under Python and worse yet if there is a need to go back and >>>> > forth >>>> > many times between Python and Java the overhead of coping huge amounts >>>> > of >>>> > data can significantly slow things down which essentially puts Python at >>>> > a >>>> > disadvantage to Java. >>>> > >>>> > If it was possible to change the model of execution such that it was >>>> > possible to embed the Python VM in the JVM or vice versa and that the >>>> > memory >>>> > could be shared between the 2 VMs the downside of using Python in this >>>> > environment would be eliminated or at the very least minimized to the >>>> > point >>>> > where it is no longer an issue. Thus the need for a jffi library. >>>> > >>>> > There is a strong desire by many to use dynamic languages in these >>>> > clustered >>>> > environments and Python is likely in the best position to become the >>>> > language of choice due to its ability to work with C based libraries and >>>> > of >>>> > course its syntax. The issues that hold Python back at this point is >>>> > the >>>> > serialization overhead, not so great state of packaging, and not having >>>> > both >>>> > the speed of the JIT and complete access to numpy/scipy ecosystem. >>>> > >>>> > Luckily for Python at this point there is no other dynamic language that >>>> > is >>>> > a clear winner today. But if too much time passes before these issues >>>> > are >>>> > solved I'm sure another language will step up to the plate. At this >>>> > point >>>> > my expectations is that Node could likely make a move. It already has >>>> > the >>>> > speed due to the Java Script JITs, it already has a great story for >>>> > packaging and deployment, and its growth is exploding on the server side >>>> > due >>>> > to all the money being poured into it. What it strongly lacks today is >>>> > the >>>> > connection to C/legacy code, numerical/scientific modules and of course >>>> > it >>>> > also does not have a solution to the data copying overhead it also has >>>> > with >>>> > the JVM. >>>> > >>>> > Any way, this is just my 2 cents on what is currently holding Python >>>> > back >>>> > from taking off in this space. >>>> > >>>> > On Thu, Mar 24, 2016 at 2:32 AM, Hakan Ardo >>>> > wrote: >>>> >> >>>> >> >>>> >> On Mar 23, 2016 21:49, "Armin Rigo" wrote: >>>> >> > >>>> >> > Hi John, >>>> >> > >>>> >> > On 23 March 2016 at 19:16, John Camara >>>> >> > wrote: >>>> >> > > I would like to suggest one more topic for the workshop. I see a >>>> >> > > big >>>> >> > > need >>>> >> > > for a library (jffi) similar to cffi but that provides a bridge to >>>> >> > > Java >>>> >> > > instead of C code. The ability to seamlessly work with native Java >>>> >> > > data/code >>>> >> > > would offer a huge improvement (...) >>>> >> > >>>> >> > Isn't it what JPype does? Can you describe how it isn't suitable for >>>> >> > your needs? >>>> >> >>>> >> There is also PyJNIus: >>>> >> >>>> >> https://pyjnius.readthedocs.org/en/latest/ >>>> > >>>> > >>>> > >>>> > _______________________________________________ >>>> > pypy-dev mailing list >>>> > pypy-dev at python.org >>>> > https://mail.python.org/mailman/listinfo/pypy-dev >>>> > >>> >>> >> _______________________________________________ >> pypy-dev mailing list >> pypy-dev at python.org >> https://mail.python.org/mailman/listinfo/pypy-dev