You are not logged in.

#1 2021-05-26 08:17:54

Middle Office
Registered: 2015-09-29
Posts: 2,436

[Solved] Spliting text based on sentences with a char limit?

Still researching various cloudy text-to-speech synths, aws polly being one of them and the basic synth command is limited to 3000 chars (I'd prefer not to dig further into somehow complicated aws cloud structure with something called S3 buckets and such). 3000 chars is a lot to cover say 5 sentences of any tekst with some punctuations, but can't really be taken for granted.

So what would be needed is something like:
Take 5 sentences from a long tekst, check for 3000 char limit,
if tekst is shorter than 3000 chars, everything fine >  send to synth.
if longer, what then? What if there is no punctuations at all?

Should I pre-split everything until it fits or is there a way to loop as it goes?

edit: So the 1st idea would be
a. Split by sentences (tmp files)
b. Check each for char limit and split further

2nd idea (seems better)
a. Split by chars and cut off until first punc. is found, store the rest.txt for next iteration
b. rest.txt + next a.

edit, this seems to be 'fine' and its fast (medium sized book splited in 0.420 s)

# tr remove newlines, sed add newlines where punctuations are, 
# sed remove double spaces, split by some bytes, keep lines
cd "$tmp" || exit
cat "some.txt" | tr '\r\n' ' ' | sed 's/[.!?]  */&\n/g' | sed 's/ \{1,\}/ /g' | split --line-bytes=800

# count generated files
all="$(find . -type f -name "x*" | wc -l)"

# print splited
for file in x*; do
    echo "($part/$all)"
    cat "$file" | fmt -w 80

Last edited by brontosaurusrex (2021-05-27 08:13:34)


#2 2021-05-27 06:13:58

Registered: 2015-09-29
Posts: 5,568

Re: [Solved] Spliting text based on sentences with a char limit?

What is the question?
How to best split a book into 3000 char sections, along sentence boundaries?
AFAICS, a sentence is defined by some sort of character, immediately followed by a dot, immediately followed by a space or Return, in any language?

And afaics, your code snippet - the cat pipeliner - already does that, no? Except it splits every 800 chars, and not 3000?

I would replace a lot of stuff there with bash builtins, but if it works, where is the question?

Give to COVAX! Here or here. (explanation)
My Repos: notabug | framagit


#3 2021-05-27 08:01:40

Middle Office
Registered: 2015-09-29
Posts: 2,436

Re: [Solved] Spliting text based on sentences with a char limit?

That + it needs to split even if there is no dots, which I seem to lucked out, but pretty sure there is a better/faster/simpler version of this.  edit: Also at the time I was thinking this could be done on-the-fly somehow, which is doable I guess, but since I want number of parts to be known in advance and resume option it gets complicated fast.

awsread iRobot.txt
Resume from part 114
(114/456 iRobot)
“All right, you son of a hunk of iron ore, if we didn’t make you, who
did?” Cutie nodded gravely. “Very good, Donovan. That was indeed the
next question.  ....
(voice Kendra)

3000 chars turns out to be too long of a block if one want's easily readable text next to speech on the screen (seems like 800-1200 is the sweet spot for me).

So I'am gonna call this the final version for now (until next bug creeps out).

Last edited by brontosaurusrex (2021-05-27 08:21:43)


Board footer

Powered by FluxBB