Everybody (almost) is talking about artificial intelligence (AI). I wanted to see if ChatGPT, a large language model-based chatbot developed by OpenAI, is able to help in coding. Can ChatGPT evaluate and improve program code? Or can it even write its own program code, based on the task you set forth?
NOTE: I am not a professional programmer – far from it! I do write scripts under Linux to perform various tasks on my computers. I have also very limited experience with ChatGPT. For this post I was using ChatGPT version 3.5 (i.e. the free version released November 2022). There is a more powerful and up-to-date version 4, available for a monthly subscription fee.
NOTE 2: ChatGPT uses, as the name suggests, a chat dialog to interact with the AI engine. As far as I can tell, ChatGPT evaluates the entire chat dialog and allows back references to earlier user input.
NOTE 3: In the examples below, my user input at the chat prompt is identified by a ‘>’ character at the beginning of the line.
Using ChatGPT to Evaluate a Bash Script
I had written a set of scripts to create and compare hashes for each file in a directory tree. The purpose of these scripts is to identify corrupt files so that they can be replaced by intact copies for backup or source (if the backup file is corrupt).
Each script performs a different task. For example, one script checks a directory tree for new, modified, or deleted files and updates the list of files and hashes. Another script compares a source directory with a backup directory (on an external or remote drive) to see if they match. Yet another script recalculates the file hashes and compares them to the hashes stored in a file to find corrupted files. All functions are kept in a separate file and called by the scripts as needed.
Below are some examples of the questions and tasks I fed to ChatGPT:
> Improve Bash script functions
Underneath the above task I copied/pasted the entire functions file mentioned before. It’s too long to copy here. Note that ChatGPT had no clew on how these functions are used within the actual scripts, as I did not share the scripts themselves.
ChatGPT 3.5 responded with “Your bash script looks well-structured and functional. However, there are some areas where improvements can be made”. This followed a list of 13 suggestions, each rather general but valid. Only point 13 was specific:
fs_mount() {
check_path
if [ -z "$lv" ]; then
return
fi
if ! command -v kpartx &> /dev/null; then
echo "'kpartx' from package 'multipath-tools' not found - abort"
exit 9
fi
# Rest of your code...
}
Compare that to my code and you can easily see the improvement:
fs_mount ()
{
if [ -z "$lv" ]; then
check_path; return
fi
! type kpartx && ( echo "'kpartx' from package 'multipath-tools' not found - abort"; exit 9 )
In my opinion ChatGPT did a surprisingly good job. The specific suggestion it made is spot-on and I immediately implemented it.
> Explain Bash function
Below that request I copied/pasted a single function. Here is my own code:
compare_hashes ()
{
# Compare two hash files (e.g. original and backup) and find hash mismatches
# Get matching file names, then filter out entries with hash mismatches
awk -F$' ' 'NR == FNR {a[$2]=$1; next}
($2 in a) && (a[$2] != $1)' "$hashfile1" "$hashfile2" > "$hashfile2-mismatch"
# Check if files are missing in the backup (hashfile2)
awk -F$' ' 'NR == FNR {a[$2]=$1; next}
! ($2 in a)' "$hashfile2" "$hashfile1" > "$hashfile2-missing"
if [ -s "$hashfile2-missing" ]; then
echo "ALERT: Backup (hash file 2) is missing files - see $hashfile2-missing"
else
rm "$hashfile2-missing"
fi
# If the file names match but the hashes don't, alert the user
if [ -s "$hashfile2-mismatch" ]; then
echo "ALERT: Some hashes don't match - see $hashfile2-mismatch"
else
echo "SUCCESS: The hashes in above hash files match."
rm "$hashfile2-mismatch"
fi
chown $logu:$logu "$hashfile2"*
}
ChatGPT 3.5 did an outstanding job in explaining exactly how this code works. Below is the ChatGPT response:
This Bash function, named compare_hashes
, is designed to compare two hash files and identify any discrepancies between them. Below is an explanation of what each section of the function does:
- Comment Block:
# Compare two hash files (e.g. original and backup) and find hash mismatches
This comment provides an overview of the purpose of the function.
- First
awk
Command:
awk -F$' ' 'NR == FNR {a[$2]=$1; next} ($2 in a) && (a[$2] != $1)' "$hashfile1" "$hashfile2" > "$hashfile2-mismatch"
This awk
command is used to compare two hash files. Let’s break down the command:
-F$' '
: This sets the field separator to two spaces. This means that it will use two consecutive spaces as the delimiter to separate fields.NR == FNR {a[$2]=$1; next}
: This block of code is executed for the first file,"$hashfile1"
.NR
is the current line number, andFNR
is the current line number of the current file. WhenNR
andFNR
are equal, it means that we are processing the first file. This block of code creates an associative arraya
where the keys are the second field ($2
, assumed to be file names) and the values are the first field ($1
, assumed to be hashes).($2 in a) && (a[$2] != $1)
: This condition checks if the file name from the second file ($2
) exists as a key in the associative arraya
, and if the hash ($1
) is different from the hash in the array. If both conditions are met, it means there is a hash mismatch."$hashfile1" "$hashfile2" > "$hashfile2-mismatch"
: This command reads both hash files ($hashfile1
and$hashfile2
) and applies theawk
command. The output, which contains the mismatched entries, is then redirected to a file called"$hashfile2-mismatch"
.
- Second
awk
Command:
awk -F$' ' 'NR == FNR {a[$2]=$1; next} ! ($2 in a)' "$hashfile2" "$hashfile1" > "$hashfile2-missing"
This awk
command is used to identify files that are in the second hash file but missing in the first. It follows a similar structure to the first awk
command.
- Error Handling and Feedback:
if [ -s "$hashfile2-missing" ]; then
echo "ALERT: Backup (hash file 2) is missing files - see $hashfile2-missing"
else
rm "$hashfile2-missing"
fi
If the file "$hashfile2-missing"
is not empty (meaning there are missing files in the backup), it displays an alert message. Otherwise, it removes the file.
- Second Set of Feedback:
if [ -s "$hashfile2-mismatch" ]; then
echo "ALERT: Some hashes don't match - see $hashfile2-mismatch"
else
echo "SUCCESS: The hashes in above hash files match."
rm "$hashfile2-mismatch"
fi
If the file "$hashfile2-mismatch"
is not empty (meaning there are mismatched hashes), it displays an alert message. Otherwise, it prints a success message and removes the file.
- Ownership Modification:
chown $logu:$logu "$hashfile2"*
This command is used to change the ownership of files matching the pattern "$hashfile2"*
to the user and group specified by the variables $logu
.
In summary, this function compares two hash files and provides feedback about any discrepancies found, including missing files and mismatched hashes. It also manages the creation and deletion of temporary files ("$hashfile2-missing"
and "$hashfile2-mismatch"
) and adjusts ownership for certain files.
Here are my comments:
- ChatGPT 3.5 provides a good description of what the function does.
- It breaks down the code into logical blocks with descriptive headings.
- It describes the somewhat complex first awk command in a simple, easily understandable way.
- Even without having access to the content of hashfile1 or hashfile2, ChatGPT correctly guesses the data fields (e.g. $2/column 2 is the file name column).
- ChatGPT correctly interprets the if.. clauses and their meaning. For example in “If the file
"$hashfile2-mismatch"
is not empty (meaning there are mismatched hashes)”.
In most if not all parts of the AI generated Bash function analysis, ChatGPT provides better and clearer explanations than what can typically be found in man-made tutorials or on the Internet.
> Explain code that handles moved and renamed files
Within my update_hash () function is some code that identifies moved or renamed files to avoid recalculating the hash for those files:
The code that handles moved and renamed files in the update_hash
function involves a multi-step process. Here’s an overview of how it works:
- Identifying Moved/Renamed Files:
awk 'NR == FNR {a[$1]; next}
($1 in a)' "$lof-del-old" "$lof-add-new" > "$lof-new-time"
awk 'NR == FNR {a[$1]; next}
($1 in a)' "$lof-add-new" "$lof-del-old" > "$lof-old-time"
- The first
awk
command compares the list of deleted files ($lof-del-old
) with the list of added files ($lof-add-new
) based on their modification timestamps. It creates a file called$lof-new-time
containing lines that exist in both lists. - The second
awk
command does the opposite: it compares the list of added files with the list of deleted files and creates a file called$lof-old-time
containing lines that exist in both lists.
Essentially, these commands identify files that have the same modification timestamps in both the old and new lists. This suggests that the file may have been moved or renamed.
- Further Filtering Based on Size and Inode:
awk 'NR == FNR {a[$2]; next}
($2 in a)' "$lof-old-time" "$lof-new-time" > "$lof-new-size"
awk 'NR == FNR {a[$2]; next}
($2 in a)' "$lof-new-time" "$lof-old-time" > "$lof-old-size"
- These commands perform a similar operation as the previous ones, but this time they compare files based on their sizes (field 2 in the file list) and inodes (field 3 in the file list).
- The result is that
$lof-new-size
will contain lines that have both the same modification time and file size, and$lof-old-size
will contain lines that have both the same modification time and file size.
- Checking Inodes:
awk 'NR == FNR {a[$3]; next}
($3 in a)' "$lof-old-size" "$lof-new-size" > "$lof-new-inode"
awk 'NR == FNR {a[$3]; next}
($3 in a)' "$lof-new-size" "$lof-old-size" > "$lof-old-inode"
- These commands continue the comparison, this time using the inode numbers. The resulting files,
$lof-new-inode
and$lof-old-inode
, will contain lines that have matching modification times, file sizes, and inode numbers.
- Creating a List of Moved/Renamed Files:
while IFS=$'\v' read -r old new
do
old_dir="$(echo "$old" | cut -f 4)"
new_dir="$(echo "$new" | cut -f 4)"
echo -e "$old_dir\t$new_dir" >> "$lof-update"
echo "$new" >> "$lof-moved"
done < <(paste -d '\v' "$lof-old-inode" "$lof-new-inode")
- This loop reads lines from a file created by pasting together the contents of
$lof-old-inode
and$lof-new-inode
. Each line in these files represents a pair of files (one from the old list, one from the new list) that have the same modification time, file size, and inode. - For each pair, it extracts the directory paths (
old_dir
andnew_dir
) and appends them to the$lof-update
file. It also appends the new file to the$lof-moved
file. - This effectively creates a list of files that are likely to have been moved or renamed.
- Updating Hashes for Moved/Renamed Files:
- The subsequent code then ensures that the hashes for moved or renamed files are not recomputed. Instead, it retains the original hash for these files.
In summary, this code identifies files that have the same modification time, size, and inode in both the old and new lists. It assumes these files have been moved or renamed and handles them accordingly by preserving their original hash values.
Here are my comments:
- Here ChatGPT does a decent job interpreting the Bash code.
- There is a mistake in section 1 above: “It creates a file called
$lof-new-time
containing lines that exist in both lists.” This is incorrect. It creates a file where the time stamp is identical. Here a practical example:
Sample content of lof-new-time file:
2023-09-16+00:55:09.0000000000 248437 296889 ./2023/LR23 Catalog.lrcat-data/LOG.old.1696516273305313
Sample content of lof-old-time file:
2023-09-16+00:55:09.0000000000 248437 296889 ./2023/LR23 Catalog.lrcat-data/LOG
As you can see, the last columns (path/file name) are different. At this point the script compares the first column only. (The next steps deal with the second and third column.)
- There is another mistake under section 2.: This part of code does not deal with inodes, only file sizes.
- ChatGPT seems to correctly use variable names to identify the data held within a variable.
ChatGPT is pretty good but not perfect. It comes as a bit of a surprise to discover mistakes in this rather straightforward example, especially when looking at how well it did with the more complex compare_hash () function.
Using ChatGPT to Write a Bash Script
We have seen that ChatGPT 3.5 is able to analyze and explain how a Bash script works. In this chapter I will use the AI platform to write a Bash script for me. Here is what I typed at the ChatGPT prompt:
> Write bash script to identify moved or renamed files (NTFS, ext4)
Here is the ChatGPT 3.5 response:
Identifying moved or renamed files across different file systems (NTFS and ext4, for example) can be a bit tricky due to differences in how they handle file metadata. One approach is to compare files based on characteristics like modification time, size, and inode.
Here’s a basic Bash script that attempts to identify moved or renamed files between two directories on different file systems. Please note that this script assumes you have appropriate permissions to access both directories.
#!/bin/bash
# Define source and target directories
source_dir="/path/to/source/directory"
target_dir="/path/to/target/directory"
# Create temporary files for storing file information
source_info=$(mktemp)
target_info=$(mktemp)
# Function to collect file information
get_file_info() {
find "$1" -type f -printf "%p\t%T+\t%s\t%i\n"
}
# Get file information for source and target directories
get_file_info "$source_dir" > "$source_info"
get_file_info "$target_dir" > "$target_info"
# Compare source and target file lists
while IFS=$'\t' read -r source_path source_time source_size source_inode
do
while IFS=$'\t' read -r target_path target_time target_size target_inode
do
if [[ "$source_time" == "$target_time" && "$source_size" == "$target_size" && "$source_inode" == "$target_inode" ]]; then
echo "File '$source_path' has likely been moved or renamed to '$target_path'"
fi
done < "$target_info"
done < "$source_info"
# Clean up temporary files
rm "$source_info"
rm "$target_info"
Here’s how this script works:
- It defines two directories:
source_dir
andtarget_dir
. - It creates temporary files to store file information (
source_info
andtarget_info
). - The
get_file_info
function usesfind
to retrieve information about each file in a specified directory. - It runs
get_file_info
for both the source and target directories and saves the results in their respective temporary files. - It then reads each line of file information from the source directory and compares it with the file information from the target directory based on modification time, size, and inode.
- If it finds matching files, it prints a message indicating that the file has likely been moved or renamed.
Please note that this script is a basic starting point and may not cover all edge cases or handle large file sets efficiently. Depending on your specific use case, you may need to further refine or optimize the approach. Additionally, be cautious when running scripts like this in production environments and ensure you have backups of your data.
My comments:
- The ChatGPT Bash code is simple and straightforward.
- The script, however, doesn’t work. The source_info must contain a snapshot of the files’ time, size, and inode info before files were moved or renamed. The target_info must contain the current file metadata. ChatGPT fails to take that into account.
- As is, the script produces false positives for each and every file that hasn’t been moved or renamed. It should only take deleted files from source_info and added files from target_info into account.
- The nested while…do..done loops are not very efficient for large numbers of files. For example, if both source_info and target_info have 30 entries, the process can take 900 iterations to finish.
- The script can be fixed, but it’s not trivial.
In summary, ChatGPT version 3.5 failed to produce a working script. It didn’t understand the complexity of the task at hand. At the same time, the description of the script suggests that the script would work.
Questions to ChatGPT
Writing the part that identifies renamed or moved files required me to do some research on file systems. Much of my data (files) is on NTFS partitions while I run these scripts on a Linux platform. NTFS doesn’t have inodes but uses a Master File Table (MFT). I haven’t found much information on the Internet so I was curious to see what ChatGPT had to say. Below are some of the questions I asked:
> If I mount a NTFS file system on Linux, can I use the inode to identify a file?
ChatGPT 3.5 replies:
No, you cannot use inodes to identify files on an NTFS file system when it’s mounted on Linux. Inodes are a concept specific to Unix-like file systems (such as ext4), and they are used to uniquely identify files within the file system.
NTFS (New Technology File System) is a file system developed by Microsoft for use in Windows operating systems. It uses a different mechanism to uniquely identify files, and this is typically based on a combination of the Master File Table (MFT) entry number and the file’s name.
When you mount an NTFS file system on Linux, the Linux NTFS driver (either ntfs-3g or the kernel NTFS driver) translates NTFS-specific information into a format compatible with the Linux file system interface. This allows Linux to interact with the NTFS file system, but it doesn’t expose NTFS-specific identifiers like MFT entry numbers or inodes directly to the user.
If you need to identify files on an NTFS file system, you’ll typically use their names, paths, or other attributes provided by the NTFS file system itself.
> Explain the difference between the ntfs-3g driver and the new kernel ntfs driver
ChatGPT 3.5 answers:
The ntfs-3g
driver and the new kernel NTFS driver are both software components designed to provide read and write access to NTFS file systems on Linux. However, they have several key differences:
- User Space vs. Kernel Space:
- ntfs-3g:
ntfs-3g
operates in user space, which means it runs outside of the Linux kernel.- It’s a FUSE (Filesystem in Userspace) driver. FUSE allows non-privileged users to create their own file systems without requiring any modifications to the kernel.
- This approach allows for faster development and easier integration, as it doesn’t require kernel modifications.
- New Kernel NTFS Driver:
- The new kernel NTFS driver is part of the Linux kernel itself.
- It operates in kernel space, meaning it’s integrated directly into the kernel.
- Kernel drivers have lower overhead and generally provide better performance compared to FUSE-based drivers.
- ntfs-3g:
- Performance:
- ntfs-3g:
- While
ntfs-3g
is user-friendly and feature-rich, it may have slightly higher CPU and memory overhead due to its FUSE architecture.
- While
- New Kernel NTFS Driver:
- The new kernel NTFS driver is expected to offer better performance because it’s operating directly in the kernel space.
- ntfs-3g:
- Maturity:
- ntfs-3g:
ntfs-3g
has been in widespread use for a long time and is considered very stable and reliable.
- New Kernel NTFS Driver:
- The new kernel NTFS driver may be relatively newer and might still be undergoing development and optimization.
- ntfs-3g:
- Compatibility:
- ntfs-3g:
ntfs-3g
is compatible with a wider range of Linux distributions and kernel versions.
- New Kernel NTFS Driver:
- The new kernel NTFS driver’s availability might depend on the specific Linux distribution and kernel version you’re using.
- ntfs-3g:
- Security:
- ntfs-3g:
- Being a user space driver,
ntfs-3g
has the advantage of being isolated from the kernel. This can add an extra layer of security.
- Being a user space driver,
- New Kernel NTFS Driver:
- Being a kernel module, it has the potential to interact more closely with the system, which can be both an advantage and a potential security concern.
- ntfs-3g:
In summary, the choice between ntfs-3g
and the new kernel NTFS driver depends on your specific requirements. If you need the best performance and are willing to accept the potential risks associated with kernel modules, the new kernel NTFS driver might be the preferred option. Otherwise, ntfs-3g
is a reliable and widely-used choice for NTFS support on Linux.
> Does the new NTFS kernel driver use inodes?
Here ChatGPT provides an interesting answer:
Yes, the new NTFS kernel driver in Linux uses inodes, just like any other file system supported by the Linux kernel. Inodes are data structures used to represent files and directories in a file system.
Each file or directory on an NTFS file system is associated with an inode, which contains metadata about the file, such as:
- File type and permissions: Whether it’s a regular file, directory, symbolic link, etc., and what permissions are set for the file.
- File size: The size of the file in bytes.
- Owner and group: The user and group associated with the file.
- Timestamps: Information about when the file was created, last accessed, and last modified.
- File attributes: Additional information about the file, such as whether it’s compressed, encrypted, etc.
- Data block pointers: Pointers to the actual data blocks on the disk where the file’s content is stored.
When you perform operations on files or directories, the kernel NTFS driver uses inodes to keep track of and manage these entities. This includes activities like opening, reading, writing, and deleting files.
Using inodes allows the kernel to efficiently manage the file system, track file metadata, and locate the actual data associated with each file. It’s a fundamental concept in how file systems work in Unix-like operating systems, including Linux.
> When I remount an NTFS file system on Linux, will the inodes be consistent with the previous mount?
The ChatGPT 3.5 answer is:
When you remount an NTFS file system on Linux, the inodes should remain consistent with the previous mount. The inode numbers themselves are specific to the file system and are managed by the file system driver.
However, it’s important to note that if you perform operations on the file system while it’s unmounted or if there are any issues with the file system, the inode mappings may become inconsistent. In such cases, running a file system check (using a tool like ntfsfix
for NTFS) may be necessary to repair any inconsistencies.
Additionally, if you’re working with a shared file system (like a networked file system), the behavior may depend on how the file system is accessed and managed across different systems. In such cases, it’s important to ensure that all systems accessing the file system are using compatible file system drivers and configurations.
In normal usage, where the file system is mounted and unmounted properly, the inodes should remain consistent across mounts on the same system.
> On an NTFS drive, which characteristics can I use to identify moved or renamed files?
Ask and you’ll get an answer from ChatGPT:
On an NTFS drive, you can use the following characteristics to potentially identify moved or renamed files:
- File Size: If a file has been moved or renamed but not modified, its size will remain the same.
- Last Modification Time (Timestamp): If a file has been moved or renamed without modification, its last modification timestamp will remain the same.
- Inode (MFT Entry Index): In NTFS, each file is represented by a Master File Table (MFT) entry, which can be thought of as an equivalent to an inode in traditional file systems. The MFT entry index is a unique identifier for each file on the file system. If a file is moved or renamed, it retains the same MFT entry index.
- File Content (Checksum or Hash): If you have recorded hashes or checksums of files, you can use them to verify if the content of a file has changed.
- Path and File Name: If a file is renamed, the path and/or name will change.
- File Signature or Magic Number: Some file formats have specific signatures or “magic numbers” at the beginning of the file. These can be used to identify the file type, and if they change, it may indicate a modification or a new file.
It’s important to note that while these characteristics can be used to identify potentially moved or renamed files, they do not guarantee it. There could be other scenarios (like a new file with the same content) where these characteristics might also match. Therefore, it’s often recommended to use a combination of these factors for more robust identification.
> How does the new NTFS kernel driver use the MFT?
I wanted to know exactly how the NTFS kernel driver uses the MFT, so here is ChatGPT 3.5’s reply:
As of my last knowledge update in September 2021, the Linux kernel’s NTFS driver (often referred to as ntfs3
or ntfs-3g
) does not directly use the Master File Table (MFT) for file operations. Instead, it relies on its own algorithms and data structures to interact with NTFS volumes.
The MFT is a crucial data structure in NTFS that maintains information about files and directories, including metadata like file names, attributes, and file extents. It essentially acts as a central database for the file system.
The ntfs-3g
driver provides NTFS read and write support for Linux. It has been designed from scratch with the goal of providing a reliable and stable solution for accessing NTFS volumes on Linux systems. It does this by implementing its own parser for the NTFS structures, allowing it to interact with the file system without direct reliance on the MFT.
The new NTFS kernel driver you mentioned might refer to any advancements or developments that have occurred after September 2021. If there have been significant changes or the introduction of a new kernel driver specifically designed to handle NTFS, I wouldn’t have information on that since my training data only includes information up until September 2021. To get the most accurate and up-to-date information, I recommend consulting official Linux documentation or forums related to kernel development.
My comment: This response partly contradicts the earlier ChatGPT responses. It also confuses between the original NTFS kernel driver, the Paragon ntfs3 kernel driver, and the Tuxera ntfs-3g user space driver. For a comparison see the table here.
Summary
This short evaluation refers explicitly to ChatGPT version 3.5 which is free of charge (as of this writing). ChatGPT is quite capable in analyzing and explaining bash scripts, though not perfect. It is surprisingly good in understanding and presenting awk language code. Here I got one of the best descriptions of how the script works.
AI is also helpful in cleaning up code and make it more readable. It provides useful guidelines for improving your program code. ChatGPT version 3.5 is less good at writing code, at least not when the task is complex. In any case, the user must carefully check the code before using it.
Using ChatGPT’s knowledge base is potentially dangerous or misleading! Sometimes it seems that ChatGPT is hallucinating. Be very careful when using AI generated information. At first the information presented by ChatGPT looks correct. But when you scrutinize it using reliable information sources, you may discover mistakes.
What is puzzling is that it presents information in a very clear and structured manner, like an accomplished master and tutor in a subject. Yet, at a closer look you discover the flaws.
Bear in mind that ChatGPT includes information up until September 2021!!! Linux has seen a lot of development and changes since then. So have other areas of knowledge.
I have not tried the new and much more up-to-date ChatGPT version 4, as that version is not free. There are other AI powered services out there, as well as dedicated automatic programming tools.
WARNING: Not only has AI flaws and shortcomings and should be used with care. As AI develops and becomes more powerful, there is a real danger of things getting out of hands. For example by feeding biased information into the AI engine. There are many more dangers associated with AI, see for example here.