The first step is to specify AWS Hadoop libraries when launching PySpark:
./bin/pyspark --packages org.apache.hadoop:hadoop-aws:2.7.1
Then before you can access objects on Amazon S3, you have to specify your access keys:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "<key>") sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey","<key>")
Now let’s open a file and calculate the number of rows in it:
f=sc.textFile("s3n://epic/dmtolpeko/fs_sum.txt") f.count()
For my sample file the result as follows:
199