create files through terminal and joining two files in script python3

kutlus

I have a recursive directory called 'dir'. I am writing to list of files from all subdirectories to a CSV file with the following command in linux on the terminal.

dir$ find . -type f -printf '%f\n' > old_names.csv

I am using a detox code to change filenames. And I am making a new list using

dir $ find . -type f -printf '%f\n' > new_names.csv

I would like to join this to lists together and make a new list with two columns something like this;

enter image description here

To do that I read both csv files into pandas data frame and join them on index as follows in python3 script

 import pandas as pd
 import csv

 df_old=pd.read_csv(os.path.join(somepath,'old_names.csv')
 df_new=pd.read_csv(os.path.join(somepath,'new_names.csv')
 df_names=df_new.join(df_old)

The problem is I am getting something like this, wrong file pairs;

enter image description here

When I open the new_names.csv I see that file list is written in a different order than old_names list so joining on index resulting in wrong pairs. How can I solve this problem?

Michael Homer

The find command just outputs in the order the filesystem gives its directory entries in, without any sorting or processing. Depending on the filesystem you're using and other factors, renaming even a single file could change the iteration order, but changing all of them is quite likely to do so. Without a tightly-controlled environment there's no particular reason that two finds should give the same order like that.

For example, many modern filesystems store names in a hash table, and iterate in the order entries appear there. A tiny filename change may be much earlier or later in the table than the original, or even cause total re-hashing of the entire directory so that everything moves. There's no realistic way to put the pieces back together in that case.

It's possible that sorting the filenames might help, if they each have a unique unchanged prefix, but that's the only realistic sort of post-processing you could do and carry on with two separate files from two find runs. I don't recommend even trying that.


However, detox does have a -v option that prints out the changes it is making (and -n to print out what it would do). You could use that to produce your CSV file, or directly from Python using subprocess.run.

detox -v ... | sed -e 's/ -> /,/' > names.csv

would produce a CSV file at least as well as one of your finds, with the old and new names automatically matched up. For the basenames (like %f did) you'll need to postprocess, which you can do in Python if necessary, or in the shell.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related