[New-bugs-announce] [issue43668] Segfault with for fresh ubuntu 20.04 install

Tue Mar 30 07:05:31 EDT 2021

New submission from axel <axel.goteman at mindified.com>:

The python interpreter segfaults when running in a miniconda environment on a fresh install of ubuntu 20.04.2. This seems to happen intermittently, both while running "pip" during the conda setup of an environment and during the execution of code like below. The issue is most have mostly been reproduced with conda, but seems to happen regardless, which is why I suspect it is a python bug. It is very odd that I can't seem to find anyone else with the same issue.

The segfault always occurs when running the following code, which reads texts from files and tokenizes the result. The segfault location changes from run to run. Also the exact same code can run on another computer with the same conda environment on a ubuntu 18.04.

The core dumps always points to some function in the unicodeobject.c file in python but the exact function changes from crash to crash. At least one crash has a clear dereferenced pointer 0x0 where the "unicode object" should be.

My guess is that something causes the python interpreter to throw away the pointed to unicode object while it is still being worked on causing a segfault. But any bug in the interpreter or NLTK should have been noticed by more users, and I cannot find anyone with similar issues. 

Things tried that didn't fix the issue:
1. Reformatting and reinstalling ubuntu
2. Switched to ubuntu 18.04 (on this computer, another computer with 18.04 can run the code just fine)
3. Replacing hardware, to ensure that RAM, or SSD disk isn't broken
4. Changing to python versions 3.8.6, 3.8.8, 3.9.2
5. Cloning the conda environment from a working computer to the broken one

Attached is one stacktrace of the fault handler along with it's corresponding core dump stack trace from gdb.

```
(eo) axel at minimind:~/test$ python tokenizer_mini.py 
2021-03-30 11:10:15.588399: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-03-30 11:10:15.588426: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Fatal Python error: Segmentation fault

Current thread 0x00007faa73bbe740 (most recent call first):
  File "tokenizer_mini.py", line 36 in preprocess_string
  File "tokenizer_mini.py", line 51 in <module>
Segmentation fault (core dumped)
```

```
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  <signal handler called>
#2  find_maxchar_surrogates (num_surrogates=<synthetic pointer>, maxchar=<synthetic pointer>, 
    end=0x4 <error: Cannot access memory at address 0x4>, begin=0x0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:1703
#3  _PyUnicode_Ready (unicode=0x7f7e4e04d7f0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:1742
#4  0x000055cd65f6df6a in PyUnicode_RichCompare (left=0x7f7e4cf43fb0, right=<optimized out>, op=2)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/unicodeobject.c:11205
#5  0x000055cd6601712a in do_richcompare (op=2, w=0x7f7e4e04d7f0, v=0x7f7e4cf43fb0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:726
#6  PyObject_RichCompare (op=2, w=0x7f7e4e04d7f0, v=0x7f7e4cf43fb0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:774
#7  PyObject_RichCompareBool (op=2, w=0x7f7e4e04d7f0, v=0x7f7e4cf43fb0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/object.c:796
#8  list_contains (a=0x7f7e4e04b4c0, el=0x7f7e4cf43fb0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/listobject.c:455
#9  0x000055cd660be41b in PySequence_Contains (ob=0x7f7e4cf43fb0, seq=0x7f7e4e04b4c0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/abstract.c:2083
#10 cmp_outcome (w=0x7f7e4e04b4c0, v=0x7f7e4cf43fb0, op=<optimized out>, tstate=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:5082
#11 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:2977
#12 0x000055cd6609f706 in PyEval_EvalFrameEx (throwflag=0, f=0x7f7e4f4d3c40)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:738
#13 function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/call.c:284
#14 _PyFunction_Vectorcall (func=<optimized out>, stack=<optimized out>, nargsf=<optimized out>, kwnames=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Objects/call.c:411
#15 0x000055cd660be54f in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x7f7f391985b8, callable=0x7f7f39084160)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Include/cpython/abstract.h:115
#16 call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>, tstate=0x55cd66c2e880)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4963
#17 _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:3500
#18 0x000055cd6609e503 in PyEval_EvalFrameEx (throwflag=0, f=0x7f7f39198440)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4298
#19 _PyEval_EvalCodeWithName (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, args=<optimized out>, 
    argcount=<optimized out>, kwnames=<optimized out>, kwargs=<optimized out>, kwcount=<optimized out>, kwstep=<optimized out>, 
    defs=<optimized out>, defcount=<optimized out>, kwdefs=<optimized out>, closure=<optimized out>, name=<optimized out>, 
    qualname=<optimized out>) at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4298
#20 0x000055cd6609f559 in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=<optimized out>, 
    args=<optimized out>, argcount=<optimized out>, kws=<optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:4327
#21 0x000055cd661429ab in PyEval_EvalCode (co=<optimized out>, globals=<optimized out>, locals=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ceval.c:718
#22 0x000055cd66142a43 in run_eval_code_obj (co=0x7f7f3910f240, globals=0x7f7f391fad80, locals=0x7f7f391fad80)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1165
#23 0x000055cd6615c6b3 in run_mod (mod=<optimized out>, filename=<optimized out>, globals=0x7f7f391fad80, locals=0x7f7f391fad80, 
    flags=<optimized out>, arena=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1187
--Type <RET> for more, q to quit, c to continue without paging--
#24 0x000055cd661615b2 in pyrun_file (fp=0x55cd66c2cdf0, filename=0x7f7f391bbee0, start=<optimized out>, globals=0x7f7f391fad80, 
    locals=0x7f7f391fad80, closeit=1, flags=0x7ffe3ee6f8e8)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:1084
#25 0x000055cd66161792 in pyrun_simple_file (flags=0x7ffe3ee6f8e8, closeit=1, filename=0x7f7f391bbee0, fp=0x55cd66c2cdf0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:439
#26 PyRun_SimpleFileExFlags (fp=0x55cd66c2cdf0, filename=<optimized out>, closeit=1, flags=0x7ffe3ee6f8e8)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/pythonrun.c:472
#27 0x000055cd66161d0d in pymain_run_file (cf=0x7ffe3ee6f8e8, config=0x55cd66c2da70)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:391
#28 pymain_run_python (exitcode=0x7ffe3ee6f8e0)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:616
#29 Py_RunMain () at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:695
#30 0x000055cd66161ec9 in Py_BytesMain (argc=<optimized out>, argv=<optimized out>)
    at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Modules/main.c:1127
#31 0x00007f7f3a3620b3 in __libc_start_main (main=0x55cd65fe3490 <main>, argc=2, argv=0x7ffe3ee6fae8, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe3ee6fad8) at ../csu/libc-start.c:308
#32 0x000055cd660d7369 in _start () at /home/conda/feedstock_root/build_artifacts/python-split_1613835706476/work/Python/ast.c:937
```

The conda environment used is below, using Miniconda3-py38_4.9.2-Linux-x86_64.sh (note that the segfault does sometimes occur during the setup of a conda environment so it's probably not related to the env)
```
name: eo
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8.8
  - pip=20.3.1
  - pip:
    - transformers==4.3.2
    - tensorflow_gpu==2.4.0
    - scikit-learn==0.23.2
    - nltk==3.5
    - matplotlib==3.2.1
    - seaborn==0.11.0
    - tensorflow-addons==0.11.2
    - tf-models-official==2.4.0
    - gspread==3.6.0
    - oauth2client==4.1.3
    - ipykernel==5.4.2
    - autopep8==1.5.4
    - torch==1.7.1
```

The code below consistently reproduces the problem, the files read are simple text files containing unicode text:

```python
from nltk.tokenize import wordpunct_tokenize
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import pickle
from pathlib import Path
import faulthandler
faulthandler.enable()

def load_data(root_path, feature, index):
    feature_root = root_path / feature
    dir1 = str(index // 10_000)
    base_path = feature_root / dir1 / str(index)
    full_path = base_path.with_suffix('.txt')
    data = None
    with open(full_path, 'r', encoding='utf-8') as f:
        data = f.read()
    return data

def preprocess_string(text, stemmer, stop_words):
    word_tokens = wordpunct_tokenize(text.lower())
    alpha_tokens = []
    for w in word_tokens:
        try:
            if (w.isalpha() and w not in stop_words):
                alpha_tokens.append(w)
        except:
            print("Something went wrong when handling the word: ", w)

    clean_tokens = []
    for w in alpha_tokens:
        try:
            word = stemmer.stem(w)
            clean_tokens.append(word)
        except:
            print("Something went wrong when stemming the word: ", w)
            clean_tokens.append(w)
    return clean_tokens

stop_words = stopwords.words('english')
stemmer = SnowballStemmer(language='english')
tokenizer = Tokenizer()

root_path = '/srv/patent/EbbaOtto/E'
for idx in range(0, 57454):
    print(f'Processed {idx}/57454', end='\r')
    desc = str(load_data(Path(root_path), 'clean_description', idx))
    desc = preprocess_string(desc, stemmer, stop_words)
    tokenizer.fit_on_texts([desc])

```

For more readable formatting read the stackoverflow post regarding the same issue:
https://stackoverflow.com/questions/66868753/segfault-with-for-fresh-ubuntu-20-04-install-using-conda

----------
components: Interpreter Core
messages: 389816
nosy: axel_1234
priority: normal
severity: normal
status: open
title: Segfault with for fresh ubuntu 20.04 install
type: crash
versions: Python 3.7, Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue43668>
_______________________________________