I am working on a PySpark script to perform a simple word count. My script runs fine, but I encounter an error when trying to save the results using saveAsTextFile (Now I'm on ubuntu). Here's the error I get:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.saveAsTextFile. org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/home/pyspark_python/wordcount/output_new already existsHere are the steps I have taken so far:
Verified that the output directory does not contain any data (ls shows it is empty).Deleted and recreated the directory using rm -r and mkdir -p.Ensured no other Spark jobs are running (ps aux | grep spark).
Despite this, the error persists when I re-run the script.
Here is the code I am using :
from pyspark import SparkConf, SparkContextimport osdef main(input_file, output_dir): # Configuration Spark conf = SparkConf().setAppName("WordCountTask").setMaster("local[*]") sc = SparkContext(conf=conf) # Lecture du fichier d'entrée text_file = sc.textFile(input_file) # Comptage des mots counts = ( text_file.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) ) # Sauvegarde des résultats if not os.path.exists(output_dir): os.makedirs(output_dir) counts.saveAsTextFile(output_dir) print(f"Résultats sauvegardés dans le répertoire : {output_dir}")if __name__ == "__main__": # Définir les chemins d'entrée et de sortie input_file = r"/home/othniel/pyspark_python/wordcount/input/loremipsum.txt" output_dir = "/home/othniel/pyspark_python/wordcount/output_new" # Exécution de la tâche WordCount main(input_file, output_dir)How can I resolve this error and ensure PySpark successfully writes to the output directory ? Is there something specific I need to configure in my script or environment ?
Thank you for your help!