[New-bugs-announce] [issue47243] Duplicate entry in 'Objects/unicodetype_db.h'

LiarPrincess report at bugs.python.org
Wed Apr 6 13:23:31 EDT 2022


New submission from LiarPrincess <mail at liarprincess.me>:

This one is so tiny that I'm not really sure we want to merge it…

=== Problem ===

`Objects/unicodetype_db.h` starts in a following way:

```c
/* a list of unique character type descriptors */
const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = {
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 0},
    {0, 0, 0, 0, 0, 32},
    {0, 0, 0, 0, 0, 48},
    …
```

The 1st record (`{0, 0, 0, 0, 0, 0}`) is duplicated.
This is not a problem, since the 1st occurrence is never used, but if we wanted to remove it then this is the ticket about it.

=== Detailed description ===

`Objects/unicodetype_db.h` is generated by `Tools/unicode/makeunicodedata.py` (I removed irrelevant lines):

```py
def makeunicodetype(unicode, trace):
    dummy = (0, 0, 0, 0, 0, 0)
    table = [dummy] # (1)
    cache = {0: dummy} # (2)

    for char in unicode.chars:
        # Things…

        item = (upper, lower, title, decimal, digit, flags)

        i = cache.get(item) # (3)
        if i is None:
            cache[item] = i = len(table)
            table.append(item)

        index[char] = i
```

- (1) - list which contains unique character properties (as `(upper, lower, title, decimal, digit, flags)` tuples)
- (2) - mapping from character properties to index in `table` - improperly initialized as a mapping from index to character properties
- (3) - we check if the current tuple is in `cache`

=== Result ===

The first time we get to a character that has `(0, 0, 0, 0, 0, 0)` properties (which is code point 0 - `NULL`) we check if it is in cache. It it not (there is an entry that goes from index `0` to `(0, 0, 0, 0, 0, 0)` - the other way around), so we add this entry to `table` and `cache`.

=== Fix ===

In the line `(2)` we should have: `cache = {dummy: 0}`. Obviously after doing so we have to run `makeunicodedata.py` - this is why this simple change modifies a lot of lines.

I will submit PR on github in just a sec…

----------
components: Unicode
messages: 416889
nosy: LiarPrincess, ezio.melotti, vstinner
priority: normal
severity: normal
status: open
title: Duplicate entry in 'Objects/unicodetype_db.h'
type: enhancement

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue47243>
_______________________________________


More information about the New-bugs-announce mailing list