0

I want to filter out an RDD which is created from a dataset, based on length of line Using: Pyspark shell

My data file looks like this

> fzDTn342L3Q   djjohnnykey 599 Music   185 1005    3.67    3   1   KDrJSNIGNDQ MacjQFNVLlQ oZ6f2vaH858 fYSjMDNa4S8 JUatPzf_eSc QfBFl7kU35c rG-rQ-YGdSA kOq6sFmoUr0 IRj1IABVBis AVsZ0VH3eN4 r1pS_4qouUc YgaNW1KRgK4 ZlGdVR7mBy4 nKFLE3DX4OQ EtQjN6CQeCc afe-0VY4YiI ekV5NseEdy8 IQs6CrER5fY jTLcoIxMI-E yfvW1ITcMpM
> 
> kOq6sFmoUr0   djjohnnykey 599 Music   113 992 0   0   1   MacjQFNVLlQ fYSjMDNa4S8 4vso1y_-cvk 8BwAX6YBx3E QeUQyf8H7vM jmc21-Nhewg hZUU2-UBaGk SaLaotssH0w PUlcrBaYpwI tjIK2xop4L0 BNlL15OYnFY _pzP7OLInjk 4daGJ6TMcp4 _8jM9R-1yRk KDrJSNIGNDQ oZ6f2vaH858 JUatPzf_eSc QfBFl7kU35c rG-rQ-YGdSA fzDTn342L3Q

Here the 4th column is category. Some of the lines in the data file does not contain this field and hence less in length. This motivates me to filter out the data set based on this criteria and further form RDDs on that set of data which has category.

I have tried to create initial RDD from the dataset.

>>> data="/Users/sk/Documents/BigData/0222/0.txt"
>>> input = sc.textFile(data)

Now I am splitting by tab and saving in lines RDDS

>>> lines = input.map(lambda x: (str(x.split('\t'))))

After this I want to filter out the lines whose length is less than 3.

>>> data="/Users/sk/Documents/BigData/0222/1.txt"
>>> input = sc.textFile(data)
>>> lines = input.map(lambda x: (str(x.split('\t'))))
>>> lines.count()
3169

>>> newinput=input.filter(lambda x: len(x)>3)
>>> newinput.count()
3169

Following this it does not change anything in my rdd. Can anyone please help out.

  • You might need to collect() and then count(). Something like newinput=input.filter(lambda x: len(x)>3).collect() & then newinput.count() – St1id3r Apr 15 at 23:41
  • Could you upload the text file? – Sai Apr 16 at 0:09
  • data set can be found here drive.google.com/file/d/0ByJLBTmJojjzR2x0MzVpc2Z6enM/view – Sowmya Kudva Apr 16 at 2:13
  • @St1id3r I tried your solution, I got below error : >>> newinput=input.filter(lambda x: len(x)>3).collect() >>> newinput.count() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: count() takes exactly one argument (0 given) – Sowmya Kudva Apr 16 at 16:39
0

A couple of things with your solution. Not sure if using RDD is advisable given that this is Python(You might wanna rethink it). Using Dataframes would be easier and performant.

>>> x =  spark.read.option("sep","\t").csv("/data/youtubedata.txt")
>>> x.count()
4100
>>> from pyspark.sql.functions import length
>>> from pyspark.sql.functions import col, size
>>> x.filter(length(col("_c3")) > 3).count()
4066
>>> x.filter(x._c3.isNull()).count()
34
>>> x.filter(x._c3.isNotNull()).count()
4066

Update: Updated with counts.

  • I tried your solution. But still no change in the number of line >>> input = sc.textFile(data) >>> input.count() 3169 – Sowmya Kudva Apr 16 at 16:59
  • Update my answer wrt to your test data. Also, if you look at the non-nulls counts, that seem to add up the count. Let me know if this works for you. – Achilleus Apr 16 at 19:28

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged or ask your own question.