Translating LaTeX documents into different languages using DeepL API











up vote
7
down vote

favorite
2












I'm coming from C and C++ and have often troubles using Python's full potential. This is a script I wrote to translate LaTeX documents into different languages. A parser replaces all latex specific syntax with hashes and then sends it all to deepl. The pydeepl wrapper is from Github.



What do you think of this code?



import sys
import re
import pydeepl
from random import choice
from tqdm import tqdm

def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat

if __name__ == "__main__":
fileInputName = sys.argv[1]
#fileInputName = "Introduction.tex"
fileOutName = fileInputName.split('.')[0]+"_trans.tex"

fileIn = open(fileInputName, "r")
fileOut = open(fileOutName, "w")

fileStr = fileIn.read()

print("Starting hashing...")

#replace commands like begin{*}, end{*}, tabs etc. with hashes
searchObj1 = re.findall( r"\begin{w+}|t| |r|\end{w+}|\usepackage{w+}|\newcommand{w+}|\include{.*}|\input{w+}|\w+[.*}|%.*", fileStr)
#random number for every found command + a prefix which hopefully doens't appear in text. Used to skip lines later, which don't need translation
list1 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj1]
#make a dictionary out of hashes
d1 = dict(zip(searchObj1,list1))
translate = make_xlat(d1)
hashedText = translate(fileStr)

#replace all latex commands (starting with a backslash) with hashes
searchObj2 = re.findall( r"\w+",hashedText)
#random number + prefix again
list2 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj2]
#make a dictionary
d2 = dict(zip(searchObj2,list2))
translate = make_xlat(d2)
hashedText = translate(hashedText)
#print(hashedText)
#fileOut.write(translate(hashedText))

d1.update(d2) # combine dictionaries
#with open('hash_dict.json', 'w') as f:
#json.dump(d1, f)

print("Hashing done. Starting translation...")

translated = ''
for line in tqdm(hashedText.splitlines()): #tqdm is a progressbar
#print(line)
if line.startswith("X#X") and len(line) == 7:
translated = translated + line + 'n'
continue
if line == 'n':
translated = translated + 'n'
elif line == '':
translated = translated + 'n'
else:
translated = translated+pydeepl.translate(line, "DE", "EN")+'n'
#translated = translated+pydeepl.translate(hashedText, "DE", "EN")
#print(translated)

d1Inv = {val:key for (key, val) in d1.items()} #swap dictionary
translate2 = make_xlat(d1Inv)
fileStrOut = translate2(translated)
#print(fileStrOut)

fileOut.write(fileStrOut)

print("success")
fileIn.close()
fileOut.close()


EDIT 1: One flaw I already know is that I'm sending sentences by sentences which makes it really slow. However, sending all lines at once produces errors which I couldn't get rid of. Deepl scrambles stuff around and dehashign doesn't work anymore.










share|improve this question









New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
    – Maarten Fabré
    Dec 5 at 13:47










  • Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
    – Mr.Sh4nnon
    Dec 5 at 13:52















up vote
7
down vote

favorite
2












I'm coming from C and C++ and have often troubles using Python's full potential. This is a script I wrote to translate LaTeX documents into different languages. A parser replaces all latex specific syntax with hashes and then sends it all to deepl. The pydeepl wrapper is from Github.



What do you think of this code?



import sys
import re
import pydeepl
from random import choice
from tqdm import tqdm

def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat

if __name__ == "__main__":
fileInputName = sys.argv[1]
#fileInputName = "Introduction.tex"
fileOutName = fileInputName.split('.')[0]+"_trans.tex"

fileIn = open(fileInputName, "r")
fileOut = open(fileOutName, "w")

fileStr = fileIn.read()

print("Starting hashing...")

#replace commands like begin{*}, end{*}, tabs etc. with hashes
searchObj1 = re.findall( r"\begin{w+}|t| |r|\end{w+}|\usepackage{w+}|\newcommand{w+}|\include{.*}|\input{w+}|\w+[.*}|%.*", fileStr)
#random number for every found command + a prefix which hopefully doens't appear in text. Used to skip lines later, which don't need translation
list1 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj1]
#make a dictionary out of hashes
d1 = dict(zip(searchObj1,list1))
translate = make_xlat(d1)
hashedText = translate(fileStr)

#replace all latex commands (starting with a backslash) with hashes
searchObj2 = re.findall( r"\w+",hashedText)
#random number + prefix again
list2 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj2]
#make a dictionary
d2 = dict(zip(searchObj2,list2))
translate = make_xlat(d2)
hashedText = translate(hashedText)
#print(hashedText)
#fileOut.write(translate(hashedText))

d1.update(d2) # combine dictionaries
#with open('hash_dict.json', 'w') as f:
#json.dump(d1, f)

print("Hashing done. Starting translation...")

translated = ''
for line in tqdm(hashedText.splitlines()): #tqdm is a progressbar
#print(line)
if line.startswith("X#X") and len(line) == 7:
translated = translated + line + 'n'
continue
if line == 'n':
translated = translated + 'n'
elif line == '':
translated = translated + 'n'
else:
translated = translated+pydeepl.translate(line, "DE", "EN")+'n'
#translated = translated+pydeepl.translate(hashedText, "DE", "EN")
#print(translated)

d1Inv = {val:key for (key, val) in d1.items()} #swap dictionary
translate2 = make_xlat(d1Inv)
fileStrOut = translate2(translated)
#print(fileStrOut)

fileOut.write(fileStrOut)

print("success")
fileIn.close()
fileOut.close()


EDIT 1: One flaw I already know is that I'm sending sentences by sentences which makes it really slow. However, sending all lines at once produces errors which I couldn't get rid of. Deepl scrambles stuff around and dehashign doesn't work anymore.










share|improve this question









New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
















  • 1




    Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
    – Maarten Fabré
    Dec 5 at 13:47










  • Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
    – Mr.Sh4nnon
    Dec 5 at 13:52













up vote
7
down vote

favorite
2









up vote
7
down vote

favorite
2






2





I'm coming from C and C++ and have often troubles using Python's full potential. This is a script I wrote to translate LaTeX documents into different languages. A parser replaces all latex specific syntax with hashes and then sends it all to deepl. The pydeepl wrapper is from Github.



What do you think of this code?



import sys
import re
import pydeepl
from random import choice
from tqdm import tqdm

def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat

if __name__ == "__main__":
fileInputName = sys.argv[1]
#fileInputName = "Introduction.tex"
fileOutName = fileInputName.split('.')[0]+"_trans.tex"

fileIn = open(fileInputName, "r")
fileOut = open(fileOutName, "w")

fileStr = fileIn.read()

print("Starting hashing...")

#replace commands like begin{*}, end{*}, tabs etc. with hashes
searchObj1 = re.findall( r"\begin{w+}|t| |r|\end{w+}|\usepackage{w+}|\newcommand{w+}|\include{.*}|\input{w+}|\w+[.*}|%.*", fileStr)
#random number for every found command + a prefix which hopefully doens't appear in text. Used to skip lines later, which don't need translation
list1 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj1]
#make a dictionary out of hashes
d1 = dict(zip(searchObj1,list1))
translate = make_xlat(d1)
hashedText = translate(fileStr)

#replace all latex commands (starting with a backslash) with hashes
searchObj2 = re.findall( r"\w+",hashedText)
#random number + prefix again
list2 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj2]
#make a dictionary
d2 = dict(zip(searchObj2,list2))
translate = make_xlat(d2)
hashedText = translate(hashedText)
#print(hashedText)
#fileOut.write(translate(hashedText))

d1.update(d2) # combine dictionaries
#with open('hash_dict.json', 'w') as f:
#json.dump(d1, f)

print("Hashing done. Starting translation...")

translated = ''
for line in tqdm(hashedText.splitlines()): #tqdm is a progressbar
#print(line)
if line.startswith("X#X") and len(line) == 7:
translated = translated + line + 'n'
continue
if line == 'n':
translated = translated + 'n'
elif line == '':
translated = translated + 'n'
else:
translated = translated+pydeepl.translate(line, "DE", "EN")+'n'
#translated = translated+pydeepl.translate(hashedText, "DE", "EN")
#print(translated)

d1Inv = {val:key for (key, val) in d1.items()} #swap dictionary
translate2 = make_xlat(d1Inv)
fileStrOut = translate2(translated)
#print(fileStrOut)

fileOut.write(fileStrOut)

print("success")
fileIn.close()
fileOut.close()


EDIT 1: One flaw I already know is that I'm sending sentences by sentences which makes it really slow. However, sending all lines at once produces errors which I couldn't get rid of. Deepl scrambles stuff around and dehashign doesn't work anymore.










share|improve this question









New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I'm coming from C and C++ and have often troubles using Python's full potential. This is a script I wrote to translate LaTeX documents into different languages. A parser replaces all latex specific syntax with hashes and then sends it all to deepl. The pydeepl wrapper is from Github.



What do you think of this code?



import sys
import re
import pydeepl
from random import choice
from tqdm import tqdm

def make_xlat(*args, **kwds):
adict = dict(*args, **kwds)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlat(match):
return adict[match.group(0)]
def xlat(text):
return rx.sub(one_xlat, text)
return xlat

if __name__ == "__main__":
fileInputName = sys.argv[1]
#fileInputName = "Introduction.tex"
fileOutName = fileInputName.split('.')[0]+"_trans.tex"

fileIn = open(fileInputName, "r")
fileOut = open(fileOutName, "w")

fileStr = fileIn.read()

print("Starting hashing...")

#replace commands like begin{*}, end{*}, tabs etc. with hashes
searchObj1 = re.findall( r"\begin{w+}|t| |r|\end{w+}|\usepackage{w+}|\newcommand{w+}|\include{.*}|\input{w+}|\w+[.*}|%.*", fileStr)
#random number for every found command + a prefix which hopefully doens't appear in text. Used to skip lines later, which don't need translation
list1 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj1]
#make a dictionary out of hashes
d1 = dict(zip(searchObj1,list1))
translate = make_xlat(d1)
hashedText = translate(fileStr)

#replace all latex commands (starting with a backslash) with hashes
searchObj2 = re.findall( r"\w+",hashedText)
#random number + prefix again
list2 = ['X#X' + str(choice(range(1111, 9999, 1))) for x in searchObj2]
#make a dictionary
d2 = dict(zip(searchObj2,list2))
translate = make_xlat(d2)
hashedText = translate(hashedText)
#print(hashedText)
#fileOut.write(translate(hashedText))

d1.update(d2) # combine dictionaries
#with open('hash_dict.json', 'w') as f:
#json.dump(d1, f)

print("Hashing done. Starting translation...")

translated = ''
for line in tqdm(hashedText.splitlines()): #tqdm is a progressbar
#print(line)
if line.startswith("X#X") and len(line) == 7:
translated = translated + line + 'n'
continue
if line == 'n':
translated = translated + 'n'
elif line == '':
translated = translated + 'n'
else:
translated = translated+pydeepl.translate(line, "DE", "EN")+'n'
#translated = translated+pydeepl.translate(hashedText, "DE", "EN")
#print(translated)

d1Inv = {val:key for (key, val) in d1.items()} #swap dictionary
translate2 = make_xlat(d1Inv)
fileStrOut = translate2(translated)
#print(fileStrOut)

fileOut.write(fileStrOut)

print("success")
fileIn.close()
fileOut.close()


EDIT 1: One flaw I already know is that I'm sending sentences by sentences which makes it really slow. However, sending all lines at once produces errors which I couldn't get rid of. Deepl scrambles stuff around and dehashign doesn't work anymore.







python api tex






share|improve this question









New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited Dec 5 at 10:15





















New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked Dec 5 at 6:46









Mr.Sh4nnon

385




385




New contributor




Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Mr.Sh4nnon is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.








  • 1




    Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
    – Maarten Fabré
    Dec 5 at 13:47










  • Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
    – Mr.Sh4nnon
    Dec 5 at 13:52














  • 1




    Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
    – Maarten Fabré
    Dec 5 at 13:47










  • Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
    – Mr.Sh4nnon
    Dec 5 at 13:52








1




1




Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
– Maarten Fabré
Dec 5 at 13:47




Instead of sending the whole text in 1 go or line per line, you could try to find a middleground and send chunks of a few lines at a time
– Maarten Fabré
Dec 5 at 13:47












Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
– Mr.Sh4nnon
Dec 5 at 13:52




Thought about that too. The problem is I cant guarantee it will work correctly. It's a tradeoff.
– Mr.Sh4nnon
Dec 5 at 13:52










2 Answers
2






active

oldest

votes

















up vote
5
down vote



accepted












  • While using sys.argv is fine for a first version of a program, you should give it a proper CLI. It would be nice if multiple latex files could be passed (since it is common to split a large project into multiple files) and also to specify the source and target languages. For this you can use argparse:



    import argparse

    def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("--from", default="DE", help="Language of the source document(s)")
    parser.add_argument("--to", default="EN", help="Language of the target document")
    parser.add_argument("file", nargs="+", help="Path(s) to the latex file(s)")
    return parser.parse_args(args)

    if __name__ == "__main__":
    args = parse_args()
    print(args.to, args.from, args.file)
    ...


    This even gives you a nice usage message when calling the script with the option -h or --help:



    usage: script_name.py [-h] [--from FROM] [--to TO] file [file ...]

    positional arguments:
    file Path(s) to the latex file(s)

    optional arguments:
    -h, --help show this help message and exit
    --from FROM Language of the source document(s)
    --to TO Language of the target document



  • Now, let's get to your "hashing". When randomly selecting from $H = 9999 - 1111 = 8888$ values, it takes on average $Q(H) = sqrt{tfrac{pi}{2}H}approx 118.16$ draws until you get a number twice. In other words, if you have more than about 120 elements, you will overwrite some. And that number is actually less than half of that because you separately hash begin and end.



    One way to avoid collisions is to just keep on counting up. For this you could use itertools.count, which produces an infinite stream of increasing numbers, starting with the argument (or 0 if not given), when being iterated:



    from itertools import count

    counter = count(1111)
    assert len(searchObj1) + len(searchObj2) <= 8888, "Too many objects to hash"
    list1 = ['X#X{}'.format(i) for _, i in zip(searchObj1, counter)]
    list2 = ['X#X{}'.format(i) for _, i in zip(searchObj2, counter)]


    Another way to make it less likely is to extend the range. And at that point you might as well use the built-in hash function:



    list1 = ['X#X{}'.format(hash(x) for x in searchObj1]


    In this case you would need to relax the and len(line) == 7 requirement later.




  • When opening files, you should use the with keyword to ensure they are properly closed, even if an exception occurs somewhere in the block:



    with open(fileInputName) as fileIn, open(fileOutName, "w") as fileOut:
    ...


  • You should also think about better names. Python has an official style-guide, PEP8, which recommend using lower_case instead of camelCase. In addition, list1, searchObj1, d2 are all not very good names.



  • Doing string addition can be very costly, because in Python strings are immutable. This means that when you do str_a + str_b, Python has to allocate a new string object of length len(str_a) + len(str_b) and then copy the content of both strings into that new array. This takes more time the longer the strings are and the more often you do it. Since you are adding strings of the length of a full document and are doing so in a loop, this can be very slow.



    Instead, build a list and str.join it at the end:



    translated = 
    for line in fileIn:
    #print(line)
    if line.startswith("X#X") and len(line) == 7:
    translated.append(line)
    elif not line.strip():
    translated.append('')
    else:
    translated.append(pydeepl.translate(line, args.from, args.to))
    translated = 'n'.join(translated)


    Note that I directly iterate over the file (since files are iterable), meaning that this program is less limited by the amount of memory available. However this means that the progress bar does not work anymore (and I therefore removed it here), because it needs to know how many lines there are. You can use something like this to add it back: https://blog.nelsonliu.me/2016/07/29/progress-bars-for-python-file-reading-with-tqdm/








share|improve this answer





















  • Thank you very much for those suggestions. I'll definitely implement all of them.
    – Mr.Sh4nnon
    Dec 5 at 11:07


















up vote
3
down vote













naming



both the naming convention is wrong. PEP-008 standardizes on snake_case for variables and functions, as the naming itself is unclear. What is rx? It is a repattern. Then call it so. adict is indeed a dict, but it contains the hashes, so also call it like that.



map



map is a useful tool, but not a lot of people use it or know it. Especially since the advent of generator expressions, it is also hardly ever needed to use. I find a generator expression a lot simpler to read:



def make_xlat(hashes: dict) -> typing.Callable[[str], str]:
pattern = re.compile("|".join(re.escape(key) for key in hashes))
def one_xlat(match):
return hashes[match.group(0)]

def xlat(text: str):
return pattern.sub(one_xlat, text)

return xlat


the hashes



You use a convoluted way str(choice(range(1111, 9999, 1))) to generate a random number. This will lead to collisions. The easiest way to tackle this, is to use a generator, which keeps what numbers are given out already.



def get_random_ids(begin=0, end=9999):
yield from random.sample(range(begin, end + 1), end - begin + 1)


The downside is this materializes a list with all the elements. For the 10000 numbers, this is still okay, but if you need a number with 10 digits, this starts to count. An alternative here would be something like this:



def get_random_ids(begin=0, end=9999):
seen = set()
while True:
number = random.randint(begin, end)
if number in seen:
continue
seen.add(number)
yield number


But this only helps if you need a limited number of id's with a long length



To help you with testing, it helps if you can supply a seed to put the pseudo-random generator in the same state every time you test a particular piece



def get_random_ids(begin=0, end=9999, seed=None, ):
"""
generates unique random integers between `begin` and `end`

The random generator can be seeded with `seed`
"""
if seed is not None:
random.seed(seed)
yield from random.sample(range(begin, end + 1), end - begin + 1)


pathlib.Path



If instead of with the bare filenames, you convert them into a Path, reading and writing becomes a lot easier:



input_file = Path("latex_sample.tex")
input_text = input_file.read_text()


then to output to the hashed text, you can use with_suffix or with_name



hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)


regex pattern



you have a very long regex pattern. In your make_xlat function you assemble one on the fly with '|'.join. You can do that here as well



commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))


This way you can easily add commands or add comments for the more obscure commands



the translation



You go over the hashed text line per line, depending on some condition translate, and then do a lot of sting concatenation



This can be done a lot simpler with a generator:



def translate(text: str, lang_in="DE", lang_out="EN"):
hash_pattern = re.compile(r"^X#Xd{4}$")
for line in text.splitlines():
if line in {"", "n"} or hash_pattern.match(line):
yield line
else:
yield pydeepl.translate(line, lang_in, lang_out)
translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))


Instead of checking startswith and len, I used a regular expression



split the work



This code:




  • reads a texts

  • hashes the commands

  • translates the text

  • dehashes the commands

  • writes to an output file

  • save the hashes to a file


You only split of a part of the hashing of the commands. It is simpler, clearer and easier to test if you also split of the rest



def hash_commands(
input_text: str, random_seed=None
) -> typing.Tuple[str, typing.Mapping]:
commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))
ids = get_random_ids(seed=random_seed)

matches = search_pattern.findall(input_text)
hashes = {
command: f"X#X{id:04}"
for command, id in zip(matches, ids)
}
translate = make_xlat(hashes)
hashed_text = translate(input_text)
return hashed_text, hashes


takes a text, and hashes the commands. A way to generalize this would be to make the commands tuple an argument, or split of the generation of the hashesdict to a separate function, and use hashes as an argument to the hash_commands function.



def dehash_text(hashed_text: str, hashes: typing.Mapping):
hashes_inverse = {hash: command for command, hash in hashes.items()}
translate = make_xlat(hashes_inverse)
return translate(hashed_text)


does the inverse.



def save_hashes(hashes, out_file):
hashes_inverse = {hash: command for command, hash in hashes.items()}
json.dump(
{"hashes": hashes, "inverse": hashes_inverse}, out_file, indent=4
)


and then the main logic become very simple and easy to understand



if __name__ == "__main__":
input_file = Path("latex_sample.tex")
input_text = input_file.read_text()
hashed_text, hashes = hash_commands(input_text=input_text, random_seed=42)

hash_file = Path("hash_dict.json")
with hash_file.open("w") as hash_filehandle:
save_hashes(hashes, hash_filehandle)

hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)

translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))
translation_dehashed = dehash_text(translation, hashes)
# print(translation_dehashed)
output_file = input_file.with_suffix(".trans.tex")
output_file.write_text(translation_dehashed)


I included an output of the hashed text, so you can see whether the function correctly recognises the commands. That way you see it doesn't recognise the usepackage{tree-dvips}






share|improve this answer























  • Thanks! learned a lot while reading this. Will add this this evening!
    – Mr.Sh4nnon
    Dec 5 at 13:53











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






Mr.Sh4nnon is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209049%2ftranslating-latex-documents-into-different-languages-using-deepl-api%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
5
down vote



accepted












  • While using sys.argv is fine for a first version of a program, you should give it a proper CLI. It would be nice if multiple latex files could be passed (since it is common to split a large project into multiple files) and also to specify the source and target languages. For this you can use argparse:



    import argparse

    def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("--from", default="DE", help="Language of the source document(s)")
    parser.add_argument("--to", default="EN", help="Language of the target document")
    parser.add_argument("file", nargs="+", help="Path(s) to the latex file(s)")
    return parser.parse_args(args)

    if __name__ == "__main__":
    args = parse_args()
    print(args.to, args.from, args.file)
    ...


    This even gives you a nice usage message when calling the script with the option -h or --help:



    usage: script_name.py [-h] [--from FROM] [--to TO] file [file ...]

    positional arguments:
    file Path(s) to the latex file(s)

    optional arguments:
    -h, --help show this help message and exit
    --from FROM Language of the source document(s)
    --to TO Language of the target document



  • Now, let's get to your "hashing". When randomly selecting from $H = 9999 - 1111 = 8888$ values, it takes on average $Q(H) = sqrt{tfrac{pi}{2}H}approx 118.16$ draws until you get a number twice. In other words, if you have more than about 120 elements, you will overwrite some. And that number is actually less than half of that because you separately hash begin and end.



    One way to avoid collisions is to just keep on counting up. For this you could use itertools.count, which produces an infinite stream of increasing numbers, starting with the argument (or 0 if not given), when being iterated:



    from itertools import count

    counter = count(1111)
    assert len(searchObj1) + len(searchObj2) <= 8888, "Too many objects to hash"
    list1 = ['X#X{}'.format(i) for _, i in zip(searchObj1, counter)]
    list2 = ['X#X{}'.format(i) for _, i in zip(searchObj2, counter)]


    Another way to make it less likely is to extend the range. And at that point you might as well use the built-in hash function:



    list1 = ['X#X{}'.format(hash(x) for x in searchObj1]


    In this case you would need to relax the and len(line) == 7 requirement later.




  • When opening files, you should use the with keyword to ensure they are properly closed, even if an exception occurs somewhere in the block:



    with open(fileInputName) as fileIn, open(fileOutName, "w") as fileOut:
    ...


  • You should also think about better names. Python has an official style-guide, PEP8, which recommend using lower_case instead of camelCase. In addition, list1, searchObj1, d2 are all not very good names.



  • Doing string addition can be very costly, because in Python strings are immutable. This means that when you do str_a + str_b, Python has to allocate a new string object of length len(str_a) + len(str_b) and then copy the content of both strings into that new array. This takes more time the longer the strings are and the more often you do it. Since you are adding strings of the length of a full document and are doing so in a loop, this can be very slow.



    Instead, build a list and str.join it at the end:



    translated = 
    for line in fileIn:
    #print(line)
    if line.startswith("X#X") and len(line) == 7:
    translated.append(line)
    elif not line.strip():
    translated.append('')
    else:
    translated.append(pydeepl.translate(line, args.from, args.to))
    translated = 'n'.join(translated)


    Note that I directly iterate over the file (since files are iterable), meaning that this program is less limited by the amount of memory available. However this means that the progress bar does not work anymore (and I therefore removed it here), because it needs to know how many lines there are. You can use something like this to add it back: https://blog.nelsonliu.me/2016/07/29/progress-bars-for-python-file-reading-with-tqdm/








share|improve this answer





















  • Thank you very much for those suggestions. I'll definitely implement all of them.
    – Mr.Sh4nnon
    Dec 5 at 11:07















up vote
5
down vote



accepted












  • While using sys.argv is fine for a first version of a program, you should give it a proper CLI. It would be nice if multiple latex files could be passed (since it is common to split a large project into multiple files) and also to specify the source and target languages. For this you can use argparse:



    import argparse

    def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("--from", default="DE", help="Language of the source document(s)")
    parser.add_argument("--to", default="EN", help="Language of the target document")
    parser.add_argument("file", nargs="+", help="Path(s) to the latex file(s)")
    return parser.parse_args(args)

    if __name__ == "__main__":
    args = parse_args()
    print(args.to, args.from, args.file)
    ...


    This even gives you a nice usage message when calling the script with the option -h or --help:



    usage: script_name.py [-h] [--from FROM] [--to TO] file [file ...]

    positional arguments:
    file Path(s) to the latex file(s)

    optional arguments:
    -h, --help show this help message and exit
    --from FROM Language of the source document(s)
    --to TO Language of the target document



  • Now, let's get to your "hashing". When randomly selecting from $H = 9999 - 1111 = 8888$ values, it takes on average $Q(H) = sqrt{tfrac{pi}{2}H}approx 118.16$ draws until you get a number twice. In other words, if you have more than about 120 elements, you will overwrite some. And that number is actually less than half of that because you separately hash begin and end.



    One way to avoid collisions is to just keep on counting up. For this you could use itertools.count, which produces an infinite stream of increasing numbers, starting with the argument (or 0 if not given), when being iterated:



    from itertools import count

    counter = count(1111)
    assert len(searchObj1) + len(searchObj2) <= 8888, "Too many objects to hash"
    list1 = ['X#X{}'.format(i) for _, i in zip(searchObj1, counter)]
    list2 = ['X#X{}'.format(i) for _, i in zip(searchObj2, counter)]


    Another way to make it less likely is to extend the range. And at that point you might as well use the built-in hash function:



    list1 = ['X#X{}'.format(hash(x) for x in searchObj1]


    In this case you would need to relax the and len(line) == 7 requirement later.




  • When opening files, you should use the with keyword to ensure they are properly closed, even if an exception occurs somewhere in the block:



    with open(fileInputName) as fileIn, open(fileOutName, "w") as fileOut:
    ...


  • You should also think about better names. Python has an official style-guide, PEP8, which recommend using lower_case instead of camelCase. In addition, list1, searchObj1, d2 are all not very good names.



  • Doing string addition can be very costly, because in Python strings are immutable. This means that when you do str_a + str_b, Python has to allocate a new string object of length len(str_a) + len(str_b) and then copy the content of both strings into that new array. This takes more time the longer the strings are and the more often you do it. Since you are adding strings of the length of a full document and are doing so in a loop, this can be very slow.



    Instead, build a list and str.join it at the end:



    translated = 
    for line in fileIn:
    #print(line)
    if line.startswith("X#X") and len(line) == 7:
    translated.append(line)
    elif not line.strip():
    translated.append('')
    else:
    translated.append(pydeepl.translate(line, args.from, args.to))
    translated = 'n'.join(translated)


    Note that I directly iterate over the file (since files are iterable), meaning that this program is less limited by the amount of memory available. However this means that the progress bar does not work anymore (and I therefore removed it here), because it needs to know how many lines there are. You can use something like this to add it back: https://blog.nelsonliu.me/2016/07/29/progress-bars-for-python-file-reading-with-tqdm/








share|improve this answer





















  • Thank you very much for those suggestions. I'll definitely implement all of them.
    – Mr.Sh4nnon
    Dec 5 at 11:07













up vote
5
down vote



accepted







up vote
5
down vote



accepted








  • While using sys.argv is fine for a first version of a program, you should give it a proper CLI. It would be nice if multiple latex files could be passed (since it is common to split a large project into multiple files) and also to specify the source and target languages. For this you can use argparse:



    import argparse

    def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("--from", default="DE", help="Language of the source document(s)")
    parser.add_argument("--to", default="EN", help="Language of the target document")
    parser.add_argument("file", nargs="+", help="Path(s) to the latex file(s)")
    return parser.parse_args(args)

    if __name__ == "__main__":
    args = parse_args()
    print(args.to, args.from, args.file)
    ...


    This even gives you a nice usage message when calling the script with the option -h or --help:



    usage: script_name.py [-h] [--from FROM] [--to TO] file [file ...]

    positional arguments:
    file Path(s) to the latex file(s)

    optional arguments:
    -h, --help show this help message and exit
    --from FROM Language of the source document(s)
    --to TO Language of the target document



  • Now, let's get to your "hashing". When randomly selecting from $H = 9999 - 1111 = 8888$ values, it takes on average $Q(H) = sqrt{tfrac{pi}{2}H}approx 118.16$ draws until you get a number twice. In other words, if you have more than about 120 elements, you will overwrite some. And that number is actually less than half of that because you separately hash begin and end.



    One way to avoid collisions is to just keep on counting up. For this you could use itertools.count, which produces an infinite stream of increasing numbers, starting with the argument (or 0 if not given), when being iterated:



    from itertools import count

    counter = count(1111)
    assert len(searchObj1) + len(searchObj2) <= 8888, "Too many objects to hash"
    list1 = ['X#X{}'.format(i) for _, i in zip(searchObj1, counter)]
    list2 = ['X#X{}'.format(i) for _, i in zip(searchObj2, counter)]


    Another way to make it less likely is to extend the range. And at that point you might as well use the built-in hash function:



    list1 = ['X#X{}'.format(hash(x) for x in searchObj1]


    In this case you would need to relax the and len(line) == 7 requirement later.




  • When opening files, you should use the with keyword to ensure they are properly closed, even if an exception occurs somewhere in the block:



    with open(fileInputName) as fileIn, open(fileOutName, "w") as fileOut:
    ...


  • You should also think about better names. Python has an official style-guide, PEP8, which recommend using lower_case instead of camelCase. In addition, list1, searchObj1, d2 are all not very good names.



  • Doing string addition can be very costly, because in Python strings are immutable. This means that when you do str_a + str_b, Python has to allocate a new string object of length len(str_a) + len(str_b) and then copy the content of both strings into that new array. This takes more time the longer the strings are and the more often you do it. Since you are adding strings of the length of a full document and are doing so in a loop, this can be very slow.



    Instead, build a list and str.join it at the end:



    translated = 
    for line in fileIn:
    #print(line)
    if line.startswith("X#X") and len(line) == 7:
    translated.append(line)
    elif not line.strip():
    translated.append('')
    else:
    translated.append(pydeepl.translate(line, args.from, args.to))
    translated = 'n'.join(translated)


    Note that I directly iterate over the file (since files are iterable), meaning that this program is less limited by the amount of memory available. However this means that the progress bar does not work anymore (and I therefore removed it here), because it needs to know how many lines there are. You can use something like this to add it back: https://blog.nelsonliu.me/2016/07/29/progress-bars-for-python-file-reading-with-tqdm/








share|improve this answer














  • While using sys.argv is fine for a first version of a program, you should give it a proper CLI. It would be nice if multiple latex files could be passed (since it is common to split a large project into multiple files) and also to specify the source and target languages. For this you can use argparse:



    import argparse

    def parse_args(args=None):
    parser = argparse.ArgumentParser()
    parser.add_argument("--from", default="DE", help="Language of the source document(s)")
    parser.add_argument("--to", default="EN", help="Language of the target document")
    parser.add_argument("file", nargs="+", help="Path(s) to the latex file(s)")
    return parser.parse_args(args)

    if __name__ == "__main__":
    args = parse_args()
    print(args.to, args.from, args.file)
    ...


    This even gives you a nice usage message when calling the script with the option -h or --help:



    usage: script_name.py [-h] [--from FROM] [--to TO] file [file ...]

    positional arguments:
    file Path(s) to the latex file(s)

    optional arguments:
    -h, --help show this help message and exit
    --from FROM Language of the source document(s)
    --to TO Language of the target document



  • Now, let's get to your "hashing". When randomly selecting from $H = 9999 - 1111 = 8888$ values, it takes on average $Q(H) = sqrt{tfrac{pi}{2}H}approx 118.16$ draws until you get a number twice. In other words, if you have more than about 120 elements, you will overwrite some. And that number is actually less than half of that because you separately hash begin and end.



    One way to avoid collisions is to just keep on counting up. For this you could use itertools.count, which produces an infinite stream of increasing numbers, starting with the argument (or 0 if not given), when being iterated:



    from itertools import count

    counter = count(1111)
    assert len(searchObj1) + len(searchObj2) <= 8888, "Too many objects to hash"
    list1 = ['X#X{}'.format(i) for _, i in zip(searchObj1, counter)]
    list2 = ['X#X{}'.format(i) for _, i in zip(searchObj2, counter)]


    Another way to make it less likely is to extend the range. And at that point you might as well use the built-in hash function:



    list1 = ['X#X{}'.format(hash(x) for x in searchObj1]


    In this case you would need to relax the and len(line) == 7 requirement later.




  • When opening files, you should use the with keyword to ensure they are properly closed, even if an exception occurs somewhere in the block:



    with open(fileInputName) as fileIn, open(fileOutName, "w") as fileOut:
    ...


  • You should also think about better names. Python has an official style-guide, PEP8, which recommend using lower_case instead of camelCase. In addition, list1, searchObj1, d2 are all not very good names.



  • Doing string addition can be very costly, because in Python strings are immutable. This means that when you do str_a + str_b, Python has to allocate a new string object of length len(str_a) + len(str_b) and then copy the content of both strings into that new array. This takes more time the longer the strings are and the more often you do it. Since you are adding strings of the length of a full document and are doing so in a loop, this can be very slow.



    Instead, build a list and str.join it at the end:



    translated = 
    for line in fileIn:
    #print(line)
    if line.startswith("X#X") and len(line) == 7:
    translated.append(line)
    elif not line.strip():
    translated.append('')
    else:
    translated.append(pydeepl.translate(line, args.from, args.to))
    translated = 'n'.join(translated)


    Note that I directly iterate over the file (since files are iterable), meaning that this program is less limited by the amount of memory available. However this means that the progress bar does not work anymore (and I therefore removed it here), because it needs to know how many lines there are. You can use something like this to add it back: https://blog.nelsonliu.me/2016/07/29/progress-bars-for-python-file-reading-with-tqdm/









share|improve this answer












share|improve this answer



share|improve this answer










answered Dec 5 at 10:52









Graipher

23k53384




23k53384












  • Thank you very much for those suggestions. I'll definitely implement all of them.
    – Mr.Sh4nnon
    Dec 5 at 11:07


















  • Thank you very much for those suggestions. I'll definitely implement all of them.
    – Mr.Sh4nnon
    Dec 5 at 11:07
















Thank you very much for those suggestions. I'll definitely implement all of them.
– Mr.Sh4nnon
Dec 5 at 11:07




Thank you very much for those suggestions. I'll definitely implement all of them.
– Mr.Sh4nnon
Dec 5 at 11:07












up vote
3
down vote













naming



both the naming convention is wrong. PEP-008 standardizes on snake_case for variables and functions, as the naming itself is unclear. What is rx? It is a repattern. Then call it so. adict is indeed a dict, but it contains the hashes, so also call it like that.



map



map is a useful tool, but not a lot of people use it or know it. Especially since the advent of generator expressions, it is also hardly ever needed to use. I find a generator expression a lot simpler to read:



def make_xlat(hashes: dict) -> typing.Callable[[str], str]:
pattern = re.compile("|".join(re.escape(key) for key in hashes))
def one_xlat(match):
return hashes[match.group(0)]

def xlat(text: str):
return pattern.sub(one_xlat, text)

return xlat


the hashes



You use a convoluted way str(choice(range(1111, 9999, 1))) to generate a random number. This will lead to collisions. The easiest way to tackle this, is to use a generator, which keeps what numbers are given out already.



def get_random_ids(begin=0, end=9999):
yield from random.sample(range(begin, end + 1), end - begin + 1)


The downside is this materializes a list with all the elements. For the 10000 numbers, this is still okay, but if you need a number with 10 digits, this starts to count. An alternative here would be something like this:



def get_random_ids(begin=0, end=9999):
seen = set()
while True:
number = random.randint(begin, end)
if number in seen:
continue
seen.add(number)
yield number


But this only helps if you need a limited number of id's with a long length



To help you with testing, it helps if you can supply a seed to put the pseudo-random generator in the same state every time you test a particular piece



def get_random_ids(begin=0, end=9999, seed=None, ):
"""
generates unique random integers between `begin` and `end`

The random generator can be seeded with `seed`
"""
if seed is not None:
random.seed(seed)
yield from random.sample(range(begin, end + 1), end - begin + 1)


pathlib.Path



If instead of with the bare filenames, you convert them into a Path, reading and writing becomes a lot easier:



input_file = Path("latex_sample.tex")
input_text = input_file.read_text()


then to output to the hashed text, you can use with_suffix or with_name



hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)


regex pattern



you have a very long regex pattern. In your make_xlat function you assemble one on the fly with '|'.join. You can do that here as well



commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))


This way you can easily add commands or add comments for the more obscure commands



the translation



You go over the hashed text line per line, depending on some condition translate, and then do a lot of sting concatenation



This can be done a lot simpler with a generator:



def translate(text: str, lang_in="DE", lang_out="EN"):
hash_pattern = re.compile(r"^X#Xd{4}$")
for line in text.splitlines():
if line in {"", "n"} or hash_pattern.match(line):
yield line
else:
yield pydeepl.translate(line, lang_in, lang_out)
translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))


Instead of checking startswith and len, I used a regular expression



split the work



This code:




  • reads a texts

  • hashes the commands

  • translates the text

  • dehashes the commands

  • writes to an output file

  • save the hashes to a file


You only split of a part of the hashing of the commands. It is simpler, clearer and easier to test if you also split of the rest



def hash_commands(
input_text: str, random_seed=None
) -> typing.Tuple[str, typing.Mapping]:
commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))
ids = get_random_ids(seed=random_seed)

matches = search_pattern.findall(input_text)
hashes = {
command: f"X#X{id:04}"
for command, id in zip(matches, ids)
}
translate = make_xlat(hashes)
hashed_text = translate(input_text)
return hashed_text, hashes


takes a text, and hashes the commands. A way to generalize this would be to make the commands tuple an argument, or split of the generation of the hashesdict to a separate function, and use hashes as an argument to the hash_commands function.



def dehash_text(hashed_text: str, hashes: typing.Mapping):
hashes_inverse = {hash: command for command, hash in hashes.items()}
translate = make_xlat(hashes_inverse)
return translate(hashed_text)


does the inverse.



def save_hashes(hashes, out_file):
hashes_inverse = {hash: command for command, hash in hashes.items()}
json.dump(
{"hashes": hashes, "inverse": hashes_inverse}, out_file, indent=4
)


and then the main logic become very simple and easy to understand



if __name__ == "__main__":
input_file = Path("latex_sample.tex")
input_text = input_file.read_text()
hashed_text, hashes = hash_commands(input_text=input_text, random_seed=42)

hash_file = Path("hash_dict.json")
with hash_file.open("w") as hash_filehandle:
save_hashes(hashes, hash_filehandle)

hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)

translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))
translation_dehashed = dehash_text(translation, hashes)
# print(translation_dehashed)
output_file = input_file.with_suffix(".trans.tex")
output_file.write_text(translation_dehashed)


I included an output of the hashed text, so you can see whether the function correctly recognises the commands. That way you see it doesn't recognise the usepackage{tree-dvips}






share|improve this answer























  • Thanks! learned a lot while reading this. Will add this this evening!
    – Mr.Sh4nnon
    Dec 5 at 13:53















up vote
3
down vote













naming



both the naming convention is wrong. PEP-008 standardizes on snake_case for variables and functions, as the naming itself is unclear. What is rx? It is a repattern. Then call it so. adict is indeed a dict, but it contains the hashes, so also call it like that.



map



map is a useful tool, but not a lot of people use it or know it. Especially since the advent of generator expressions, it is also hardly ever needed to use. I find a generator expression a lot simpler to read:



def make_xlat(hashes: dict) -> typing.Callable[[str], str]:
pattern = re.compile("|".join(re.escape(key) for key in hashes))
def one_xlat(match):
return hashes[match.group(0)]

def xlat(text: str):
return pattern.sub(one_xlat, text)

return xlat


the hashes



You use a convoluted way str(choice(range(1111, 9999, 1))) to generate a random number. This will lead to collisions. The easiest way to tackle this, is to use a generator, which keeps what numbers are given out already.



def get_random_ids(begin=0, end=9999):
yield from random.sample(range(begin, end + 1), end - begin + 1)


The downside is this materializes a list with all the elements. For the 10000 numbers, this is still okay, but if you need a number with 10 digits, this starts to count. An alternative here would be something like this:



def get_random_ids(begin=0, end=9999):
seen = set()
while True:
number = random.randint(begin, end)
if number in seen:
continue
seen.add(number)
yield number


But this only helps if you need a limited number of id's with a long length



To help you with testing, it helps if you can supply a seed to put the pseudo-random generator in the same state every time you test a particular piece



def get_random_ids(begin=0, end=9999, seed=None, ):
"""
generates unique random integers between `begin` and `end`

The random generator can be seeded with `seed`
"""
if seed is not None:
random.seed(seed)
yield from random.sample(range(begin, end + 1), end - begin + 1)


pathlib.Path



If instead of with the bare filenames, you convert them into a Path, reading and writing becomes a lot easier:



input_file = Path("latex_sample.tex")
input_text = input_file.read_text()


then to output to the hashed text, you can use with_suffix or with_name



hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)


regex pattern



you have a very long regex pattern. In your make_xlat function you assemble one on the fly with '|'.join. You can do that here as well



commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))


This way you can easily add commands or add comments for the more obscure commands



the translation



You go over the hashed text line per line, depending on some condition translate, and then do a lot of sting concatenation



This can be done a lot simpler with a generator:



def translate(text: str, lang_in="DE", lang_out="EN"):
hash_pattern = re.compile(r"^X#Xd{4}$")
for line in text.splitlines():
if line in {"", "n"} or hash_pattern.match(line):
yield line
else:
yield pydeepl.translate(line, lang_in, lang_out)
translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))


Instead of checking startswith and len, I used a regular expression



split the work



This code:




  • reads a texts

  • hashes the commands

  • translates the text

  • dehashes the commands

  • writes to an output file

  • save the hashes to a file


You only split of a part of the hashing of the commands. It is simpler, clearer and easier to test if you also split of the rest



def hash_commands(
input_text: str, random_seed=None
) -> typing.Tuple[str, typing.Mapping]:
commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))
ids = get_random_ids(seed=random_seed)

matches = search_pattern.findall(input_text)
hashes = {
command: f"X#X{id:04}"
for command, id in zip(matches, ids)
}
translate = make_xlat(hashes)
hashed_text = translate(input_text)
return hashed_text, hashes


takes a text, and hashes the commands. A way to generalize this would be to make the commands tuple an argument, or split of the generation of the hashesdict to a separate function, and use hashes as an argument to the hash_commands function.



def dehash_text(hashed_text: str, hashes: typing.Mapping):
hashes_inverse = {hash: command for command, hash in hashes.items()}
translate = make_xlat(hashes_inverse)
return translate(hashed_text)


does the inverse.



def save_hashes(hashes, out_file):
hashes_inverse = {hash: command for command, hash in hashes.items()}
json.dump(
{"hashes": hashes, "inverse": hashes_inverse}, out_file, indent=4
)


and then the main logic become very simple and easy to understand



if __name__ == "__main__":
input_file = Path("latex_sample.tex")
input_text = input_file.read_text()
hashed_text, hashes = hash_commands(input_text=input_text, random_seed=42)

hash_file = Path("hash_dict.json")
with hash_file.open("w") as hash_filehandle:
save_hashes(hashes, hash_filehandle)

hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)

translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))
translation_dehashed = dehash_text(translation, hashes)
# print(translation_dehashed)
output_file = input_file.with_suffix(".trans.tex")
output_file.write_text(translation_dehashed)


I included an output of the hashed text, so you can see whether the function correctly recognises the commands. That way you see it doesn't recognise the usepackage{tree-dvips}






share|improve this answer























  • Thanks! learned a lot while reading this. Will add this this evening!
    – Mr.Sh4nnon
    Dec 5 at 13:53













up vote
3
down vote










up vote
3
down vote









naming



both the naming convention is wrong. PEP-008 standardizes on snake_case for variables and functions, as the naming itself is unclear. What is rx? It is a repattern. Then call it so. adict is indeed a dict, but it contains the hashes, so also call it like that.



map



map is a useful tool, but not a lot of people use it or know it. Especially since the advent of generator expressions, it is also hardly ever needed to use. I find a generator expression a lot simpler to read:



def make_xlat(hashes: dict) -> typing.Callable[[str], str]:
pattern = re.compile("|".join(re.escape(key) for key in hashes))
def one_xlat(match):
return hashes[match.group(0)]

def xlat(text: str):
return pattern.sub(one_xlat, text)

return xlat


the hashes



You use a convoluted way str(choice(range(1111, 9999, 1))) to generate a random number. This will lead to collisions. The easiest way to tackle this, is to use a generator, which keeps what numbers are given out already.



def get_random_ids(begin=0, end=9999):
yield from random.sample(range(begin, end + 1), end - begin + 1)


The downside is this materializes a list with all the elements. For the 10000 numbers, this is still okay, but if you need a number with 10 digits, this starts to count. An alternative here would be something like this:



def get_random_ids(begin=0, end=9999):
seen = set()
while True:
number = random.randint(begin, end)
if number in seen:
continue
seen.add(number)
yield number


But this only helps if you need a limited number of id's with a long length



To help you with testing, it helps if you can supply a seed to put the pseudo-random generator in the same state every time you test a particular piece



def get_random_ids(begin=0, end=9999, seed=None, ):
"""
generates unique random integers between `begin` and `end`

The random generator can be seeded with `seed`
"""
if seed is not None:
random.seed(seed)
yield from random.sample(range(begin, end + 1), end - begin + 1)


pathlib.Path



If instead of with the bare filenames, you convert them into a Path, reading and writing becomes a lot easier:



input_file = Path("latex_sample.tex")
input_text = input_file.read_text()


then to output to the hashed text, you can use with_suffix or with_name



hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)


regex pattern



you have a very long regex pattern. In your make_xlat function you assemble one on the fly with '|'.join. You can do that here as well



commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))


This way you can easily add commands or add comments for the more obscure commands



the translation



You go over the hashed text line per line, depending on some condition translate, and then do a lot of sting concatenation



This can be done a lot simpler with a generator:



def translate(text: str, lang_in="DE", lang_out="EN"):
hash_pattern = re.compile(r"^X#Xd{4}$")
for line in text.splitlines():
if line in {"", "n"} or hash_pattern.match(line):
yield line
else:
yield pydeepl.translate(line, lang_in, lang_out)
translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))


Instead of checking startswith and len, I used a regular expression



split the work



This code:




  • reads a texts

  • hashes the commands

  • translates the text

  • dehashes the commands

  • writes to an output file

  • save the hashes to a file


You only split of a part of the hashing of the commands. It is simpler, clearer and easier to test if you also split of the rest



def hash_commands(
input_text: str, random_seed=None
) -> typing.Tuple[str, typing.Mapping]:
commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))
ids = get_random_ids(seed=random_seed)

matches = search_pattern.findall(input_text)
hashes = {
command: f"X#X{id:04}"
for command, id in zip(matches, ids)
}
translate = make_xlat(hashes)
hashed_text = translate(input_text)
return hashed_text, hashes


takes a text, and hashes the commands. A way to generalize this would be to make the commands tuple an argument, or split of the generation of the hashesdict to a separate function, and use hashes as an argument to the hash_commands function.



def dehash_text(hashed_text: str, hashes: typing.Mapping):
hashes_inverse = {hash: command for command, hash in hashes.items()}
translate = make_xlat(hashes_inverse)
return translate(hashed_text)


does the inverse.



def save_hashes(hashes, out_file):
hashes_inverse = {hash: command for command, hash in hashes.items()}
json.dump(
{"hashes": hashes, "inverse": hashes_inverse}, out_file, indent=4
)


and then the main logic become very simple and easy to understand



if __name__ == "__main__":
input_file = Path("latex_sample.tex")
input_text = input_file.read_text()
hashed_text, hashes = hash_commands(input_text=input_text, random_seed=42)

hash_file = Path("hash_dict.json")
with hash_file.open("w") as hash_filehandle:
save_hashes(hashes, hash_filehandle)

hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)

translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))
translation_dehashed = dehash_text(translation, hashes)
# print(translation_dehashed)
output_file = input_file.with_suffix(".trans.tex")
output_file.write_text(translation_dehashed)


I included an output of the hashed text, so you can see whether the function correctly recognises the commands. That way you see it doesn't recognise the usepackage{tree-dvips}






share|improve this answer














naming



both the naming convention is wrong. PEP-008 standardizes on snake_case for variables and functions, as the naming itself is unclear. What is rx? It is a repattern. Then call it so. adict is indeed a dict, but it contains the hashes, so also call it like that.



map



map is a useful tool, but not a lot of people use it or know it. Especially since the advent of generator expressions, it is also hardly ever needed to use. I find a generator expression a lot simpler to read:



def make_xlat(hashes: dict) -> typing.Callable[[str], str]:
pattern = re.compile("|".join(re.escape(key) for key in hashes))
def one_xlat(match):
return hashes[match.group(0)]

def xlat(text: str):
return pattern.sub(one_xlat, text)

return xlat


the hashes



You use a convoluted way str(choice(range(1111, 9999, 1))) to generate a random number. This will lead to collisions. The easiest way to tackle this, is to use a generator, which keeps what numbers are given out already.



def get_random_ids(begin=0, end=9999):
yield from random.sample(range(begin, end + 1), end - begin + 1)


The downside is this materializes a list with all the elements. For the 10000 numbers, this is still okay, but if you need a number with 10 digits, this starts to count. An alternative here would be something like this:



def get_random_ids(begin=0, end=9999):
seen = set()
while True:
number = random.randint(begin, end)
if number in seen:
continue
seen.add(number)
yield number


But this only helps if you need a limited number of id's with a long length



To help you with testing, it helps if you can supply a seed to put the pseudo-random generator in the same state every time you test a particular piece



def get_random_ids(begin=0, end=9999, seed=None, ):
"""
generates unique random integers between `begin` and `end`

The random generator can be seeded with `seed`
"""
if seed is not None:
random.seed(seed)
yield from random.sample(range(begin, end + 1), end - begin + 1)


pathlib.Path



If instead of with the bare filenames, you convert them into a Path, reading and writing becomes a lot easier:



input_file = Path("latex_sample.tex")
input_text = input_file.read_text()


then to output to the hashed text, you can use with_suffix or with_name



hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)


regex pattern



you have a very long regex pattern. In your make_xlat function you assemble one on the fly with '|'.join. You can do that here as well



commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))


This way you can easily add commands or add comments for the more obscure commands



the translation



You go over the hashed text line per line, depending on some condition translate, and then do a lot of sting concatenation



This can be done a lot simpler with a generator:



def translate(text: str, lang_in="DE", lang_out="EN"):
hash_pattern = re.compile(r"^X#Xd{4}$")
for line in text.splitlines():
if line in {"", "n"} or hash_pattern.match(line):
yield line
else:
yield pydeepl.translate(line, lang_in, lang_out)
translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))


Instead of checking startswith and len, I used a regular expression



split the work



This code:




  • reads a texts

  • hashes the commands

  • translates the text

  • dehashes the commands

  • writes to an output file

  • save the hashes to a file


You only split of a part of the hashing of the commands. It is simpler, clearer and easier to test if you also split of the rest



def hash_commands(
input_text: str, random_seed=None
) -> typing.Tuple[str, typing.Mapping]:
commands = (
r"\begin{w+}",
r"t",
" ",
"r",
r"\end{w+}",
r"\usepackage{w+}",
r"\newcommand{w+}",
r"\include{.*}",
r"\input{w+}",
r"\w+[.*}",
r"%.*",
r"\w+",
)
search_pattern = re.compile("|".join(commands))
ids = get_random_ids(seed=random_seed)

matches = search_pattern.findall(input_text)
hashes = {
command: f"X#X{id:04}"
for command, id in zip(matches, ids)
}
translate = make_xlat(hashes)
hashed_text = translate(input_text)
return hashed_text, hashes


takes a text, and hashes the commands. A way to generalize this would be to make the commands tuple an argument, or split of the generation of the hashesdict to a separate function, and use hashes as an argument to the hash_commands function.



def dehash_text(hashed_text: str, hashes: typing.Mapping):
hashes_inverse = {hash: command for command, hash in hashes.items()}
translate = make_xlat(hashes_inverse)
return translate(hashed_text)


does the inverse.



def save_hashes(hashes, out_file):
hashes_inverse = {hash: command for command, hash in hashes.items()}
json.dump(
{"hashes": hashes, "inverse": hashes_inverse}, out_file, indent=4
)


and then the main logic become very simple and easy to understand



if __name__ == "__main__":
input_file = Path("latex_sample.tex")
input_text = input_file.read_text()
hashed_text, hashes = hash_commands(input_text=input_text, random_seed=42)

hash_file = Path("hash_dict.json")
with hash_file.open("w") as hash_filehandle:
save_hashes(hashes, hash_filehandle)

hashed_file = input_file.with_suffix(".hash.tex")
hashed_file.write_text(hashed_text)

translation = "n".join(translate(hashed_text, lang_in="DE", lang_out="EN"))
translation_dehashed = dehash_text(translation, hashes)
# print(translation_dehashed)
output_file = input_file.with_suffix(".trans.tex")
output_file.write_text(translation_dehashed)


I included an output of the hashed text, so you can see whether the function correctly recognises the commands. That way you see it doesn't recognise the usepackage{tree-dvips}







share|improve this answer














share|improve this answer



share|improve this answer








edited Dec 5 at 13:58

























answered Dec 5 at 13:40









Maarten Fabré

4,384417




4,384417












  • Thanks! learned a lot while reading this. Will add this this evening!
    – Mr.Sh4nnon
    Dec 5 at 13:53


















  • Thanks! learned a lot while reading this. Will add this this evening!
    – Mr.Sh4nnon
    Dec 5 at 13:53
















Thanks! learned a lot while reading this. Will add this this evening!
– Mr.Sh4nnon
Dec 5 at 13:53




Thanks! learned a lot while reading this. Will add this this evening!
– Mr.Sh4nnon
Dec 5 at 13:53










Mr.Sh4nnon is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















Mr.Sh4nnon is a new contributor. Be nice, and check out our Code of Conduct.













Mr.Sh4nnon is a new contributor. Be nice, and check out our Code of Conduct.












Mr.Sh4nnon is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209049%2ftranslating-latex-documents-into-different-languages-using-deepl-api%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Plaza Victoria

In PowerPoint, is there a keyboard shortcut for bulleted / numbered list?

How to put 3 figures in Latex with 2 figures side by side and 1 below these side by side images but in...