[Tutor] \x00T\x00r\x00i\x00a\x00 ie I get \x00 breaking up every character ?

Steve Willoughby steve at alchemy.com
Sun Nov 20 21:26:30 CET 2011


On 20-Nov-11 12:04, Sarma Tangirala wrote:
> Would the html parser library in python be a better idea as opposed to
> using split? That way you have greater control over what is in the html.

Absolutely. And it would handle improper HTML (like unmatched brackets) 
gracefully where the split will just do the wrong thing.

>
> On 20 Nov 2011 23:58, "dave selby" <dave6502 at gmail.com
> <mailto:dave6502 at gmail.com>> wrote:
>
>     Hi All,
>
>     I have a long string which is an HTML file, I strip the HTML tags away
>     and make a list with
>
>     text = re.split('<.*?>', HTML)
>
>     I then tried to search for a string with text.index(...) but it was
>     not found, printing HTML to a terminal I get what I expect, a block of
>     tags and text, I split the HTML and print text and I get loads of
>
>     \x00T\x00r\x00i\x00a\x00  ie I get \x00 breaking up every character.
>
>     Any idea what is happening and how to get back to a list of ascii
>     strings ?
>
>     Cheers
>
>     Dave
>
>     --
>
>     Please avoid sending me Word or PowerPoint attachments.
>     See http://www.gnu.org/philosophy/no-word-attachments.html
>     _______________________________________________
>     Tutor maillist  - Tutor at python.org <mailto:Tutor at python.org>
>     To unsubscribe or change subscription options:
>     http://mail.python.org/mailman/listinfo/tutor
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor


-- 
Steve Willoughby / steve at alchemy.com
"A ship in harbor is safe, but that is not what ships are built for."
PGP Fingerprint 4615 3CCE 0F29 AE6C 8FF4 CA01 73FE 997A 765D 696C


More information about the Tutor mailing list