I want to filter out an RDD which is created from a dataset, based on length of line Using: Pyspark shell
My data file looks like this
> fzDTn342L3Q djjohnnykey 599 Music 185 1005 3.67 3 1 KDrJSNIGNDQ MacjQFNVLlQ oZ6f2vaH858 fYSjMDNa4S8 JUatPzf_eSc QfBFl7kU35c rG-rQ-YGdSA kOq6sFmoUr0 IRj1IABVBis AVsZ0VH3eN4 r1pS_4qouUc YgaNW1KRgK4 ZlGdVR7mBy4 nKFLE3DX4OQ EtQjN6CQeCc afe-0VY4YiI ekV5NseEdy8 IQs6CrER5fY jTLcoIxMI-E yfvW1ITcMpM > > kOq6sFmoUr0 djjohnnykey 599 Music 113 992 0 0 1 MacjQFNVLlQ fYSjMDNa4S8 4vso1y_-cvk 8BwAX6YBx3E QeUQyf8H7vM jmc21-Nhewg hZUU2-UBaGk SaLaotssH0w PUlcrBaYpwI tjIK2xop4L0 BNlL15OYnFY _pzP7OLInjk 4daGJ6TMcp4 _8jM9R-1yRk KDrJSNIGNDQ oZ6f2vaH858 JUatPzf_eSc QfBFl7kU35c rG-rQ-YGdSA fzDTn342L3Q
Here the 4th column is category. Some of the lines in the data file does not contain this field and hence less in length. This motivates me to filter out the data set based on this criteria and further form RDDs on that set of data which has category.
I have tried to create initial RDD from the dataset.
>>> data="/Users/sk/Documents/BigData/0222/0.txt" >>> input = sc.textFile(data)
Now I am splitting by tab and saving in lines RDDS
>>> lines = input.map(lambda x: (str(x.split('\t'))))
After this I want to filter out the lines whose length is less than 3.
>>> data="/Users/sk/Documents/BigData/0222/1.txt" >>> input = sc.textFile(data) >>> lines = input.map(lambda x: (str(x.split('\t')))) >>> lines.count() 3169 >>> newinput=input.filter(lambda x: len(x)>3) >>> newinput.count() 3169
Following this it does not change anything in my rdd. Can anyone please help out.