[Tutor] Newbie Trouble Processing SRT Strings In Text File
Dave Angel
davea at davea.name
Tue Nov 4 04:43:28 CET 2014
Please evaluate your email program. Some of your newline s are
being lost in the paste into your email.
Matt Varner <matt.l.varner at gmail.com> Wrote in message:
> TL:DR - Skip to "My Script: "subtrans.py"
>
> <beg>
>
> Optional Links to (perhaps) Helpful Images:
> 1. The SRT download button:
> http://i70.photobucket.com/albums/i82/RavingNoah/Python%20Help/tutor1_zps080f20f7.png
>
> 2. A visual comparison of my current problem (see 'Desire Versus
> Reality' below):
> http://i70.photobucket.com/albums/i82/RavingNoah/Python%20Help/newline_problem_zps307f8cab.jpg
>
> ============
> The SRT File
> ============
>
> The SRT file that you can download for every lesson that has a video
> contains the caption transcript data and is organized according to
> text snippets with some connecting time data.
>
> ========================================
> Reading the SRT File and Outputting Something Useful
> ========================================
>
> There may be a hundred different ways to read one of these file types.
> The reliable method I chose was to use a handy editor for the purpose
> called Aegisub. It will open the SRT file and let me immediately
> export a version of it, without the time data (which I don't
> need...yet). The result of the export is a plain-text file containing
> each string snippet and a newline character.
>
> ==========================
> Dealing with the Text File
> ==========================
>
> One of these text files can be anywhere between 130 to 500 lines or
> longer, depending (obviously) on the length of its attendant video.
> For my purposes, as a springboard for extending my own notes for each
> module, I need to concatenate each string with an acceptable format.
> My desire for this is to interject spaces where I need them and kill
> all the newline characters so that I get just one big lump of properly
> spaced paragraph text. From here, I can divide up the paragraphs how
> I see fit and I'm golden...
>
> ==============================
> My first Python script: Issues
> ==============================
>
> I did my due diligence. I have read the tutorial at www.python.org.
But did you actually try out and analyze each concept? Difference
between read and study.
> I went to my local library and have a copy of "Python Programming for
> the Absolute Beginner, 3rd Edition by Michael Dawson." I started
> collecting what seemed like logical little bits here and there from
> examples found using Uncle Google, but none of the examples anywhere
> were close enough, contextually, to be automatically picked up by my
> dense 'noobiosity.' For instance, when discussing string
> methods...almost all operations taught to beginners are done on
> strings generated "on the fly," directly inputted into IDLE, but not
> on strings that are contained in an external file.
When it's in the file, it's not a str. Reading it in produces a
string or a list of strings. And once created you can not tell
if they came from a file, a literal, or some arbitrary
expression.
> There are other
> examples for file operations, but none of them involved doing string
> operations afterward. After many errors about not being able to
> directly edit strings in a file object, I finally figured out that
> lists are used to read and store strings kept in a file like the one
> I'm sourcing from...so I tried using that. Then I spent hours
> unsuccessfully trying to call strings using index numbers from the
> list object (I guess I'm dense). Anyhow, I put together my little
> snippets and have been banging my head against the wall for a couple
> of days now.
>
> After many frustrating attempts, I have NEARLY produced what I'm
> looking to achieve in my test file.
>
> ================
> Example - Source
> ================
>
> My Test file contains just twelve lines of a much larger (but no more
> complex) file that is typical for the SRT subtitle caption file, of
> which I expect to have to process a hundred...or hundreds, depending
> on how many there are in all of the courses I plan to take
> (coincidentally, there is one on Python)
>
> Line 01: # Exported by Aegisub 3.2.1
> Line 02: [Deep Dive]
> Line 03: [CSS Values & Units Numeric and Textual Data Types with
> Guil Hernandez]
> Line 04: In this video, we'll go over the
> Line 05: common numeric and textual values
> Line 06: that CSS properties can accept.
> Line 07: Let's get started.
> Line 08: So, here we have a simple HTML page
> Line 09: containing a div and a paragraph
> Line 10: element nested inside.
> Line 11: It's linked to a style sheet named style.css
> Line 12: and this is where we'll be creating our new CSS rules.
>
> ========================
> My Script: "subtrans.py"
> ========================
>
> # Open the target file, create file object
> f = open('tmp.txt', 'r')
>
> # Create an output file to write the changed strings to
> o = open('result.txt', 'w')
>
> # Create a list object that holds all the strings in the file object
> lns = f.readlines()
>
> # Close the source file you no longer
> # need now that you have
> your strings
> f.close()
>
> # Import sys to get at stdout (standard output) - "print" results will
> be written to file
> import sys
>
> # Associate stdout with the output file
> sys.stdout = o
>
No, just use o.write directly. Going through print is a waste of
yout energy.
> # Try to print strings to output file using loopback variable (line)
> and the list object
> for line in lns:
> if ".\n" in line:
> a = line.replace('.\n','. ')
> print(a.strip('\n'))
> else:
> b = line.strip('\n')
> print(b + " ")
>
Consider joining all the strings in your list with
"".join (lns)
And just do one o.write of the result.
> # Close your output file
> o.close()
>
> =================
> Desire Versus Reality
> =================
>
> The source file contains a series of strings with newline characters
> directly following whatever the last character in the snippet...with
> absolutely no whitespace. This is a problem for me if I want to
> concatentate it back together into paragraph text to use as the
> jumping off point for my additional notes. I've been literally taking
> four hours to type explicitly the dialogue from the videos I've been
> watching...and I know this is going to save me a lot of time and get
> me interacting with the lessons faster and more efficiently.
> However...
>
> My script succeeds in processing the source file and adding the right
> amount of spaces for each line, the rule being "two spaces added
> following a period, and one space added following a string with no
> period in it (technically, a period/newline pairing (which was the
> only way I could figure out not target the period in 'example.css' or
> 'version 2.3.2'.
>
> But, even though it successfully kills these additional newlines that
> seem to form in the list-making process
They aren't extra, they're in the file.
> ...I end up with basically a
> non-concatenated file of strings...with the right spaces I need, but
> not one big chunk of text, like I expect using the s.strip('\n')
> functionality.
That's because you're using print () which defaults to a trailing
newline. To avoid that there's a keyword parameter to print
function which can suppress the newline.
Note that you haven't explicitly addressed the file encodings for
input or output.
>
>
>
--
DaveA
More information about the Tutor
mailing list