[New-bugs-announce] [issue25743] Clarify exactly what \w matches in UNICODE mode

Zack Weinberg report at bugs.python.org
Fri Nov 27 10:50:58 EST 2015


New submission from Zack Weinberg:

The `re` module documentation does not do a good job of explaining exactly what `\w` matches.  Quoting https://docs.python.org/3.5/library/re.html :

> \w
> For Unicode (str) patterns:
> Matches Unicode word characters; this includes most characters
> that can be part of a word in any language, as well as numbers
> and the underscore.

Empirically, this appears to mean "everything in Unicode general categories L* and N*, plus U+005F (underscore)".  That is a perfectly sensible definition and the documentation should state it in those terms.  "Unicode word characters" could mean any number of different things; note for instance that UTS#18 gives a very different definition.

(Further reading: https://gist.github.com/zackw/3077f387591376c7bf67 plus links therefrom).

----------
assignee: docs at python
components: Documentation
messages: 255463
nosy: docs at python, zwol
priority: normal
severity: normal
status: open
title: Clarify exactly what \w matches in UNICODE mode
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue25743>
_______________________________________


More information about the New-bugs-announce mailing list