Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users. Use Cases for Apache Spark often are related to machine/deep learning, graph processing.

-1
votes
0answers
18 views

Apache-Spark: action after groupBy (with filter) cause TimeoutExc/OOM

I use scala 2.11.11 & spark 2.3 and want to process a large rdd (~25GB) but got with // [example code] val conf= new SparkConf().setMaster("local[4]").setAppName("BatchExample") val spark = ...
0
votes
1answer
19 views

Pyspark create dictionary within groupby

Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example: import pyspark from pyspark.sql import Row import pyspark.sql.functions as F sc = pyspark.SparkContext() ...
-2
votes
0answers
9 views

Writing RDD to hive table in remote cluster which has json in it

I have rdd which am trying to write into hive table which is located in remote cluster. myrdd.foreach(println) gives 2cbeb5cb-219c-4a84-b0b1-fa13de0cbbd4,abc,2019-03-22 17:24:17.484,xyz,some,N,{"a":"...
-1
votes
0answers
11 views

Spark: copy column value to subsequent rows until the condition met

I have input dataframe like this column1|column2 -------|------- false |null true |123 false |null false |null false |null true |234 false |null true |456 false |null false |null ...
0
votes
1answer
18 views

spark partitioning breaks the lazy evaluation chain and triggers error which I cannot catch

When doing the re-partitioning spark is breaking the lazy-evaluation-chain and triggers the error which I cannot control/catch. //simulation of reading a stream from s3 def readFromS3(partition: Int)...
1
vote
0answers
13 views

Can the parallel or snow packages in R interface with a spark cluster?

I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a ...
-1
votes
0answers
15 views

Null Pointer exception on performing an action on RDD [duplicate]

I am trying to populate a table structure using Dataframes. Here column of the table is initially being represented by a dataframe which would be in turn filled by a scala stream backed by a regular ...
-1
votes
1answer
10 views

Get instance of Azure data bricks Spark in Python code

I am developing a python package which will be deployed into databricks cluster. We often need reference to the "spark" and "dbutils" object within the python code. We can access these objects ...
0
votes
1answer
13 views

Can not start Spark-shell

Hello I've unzip and exported spark path. Whne I lunch it I got this error. export PATH=$PATH:/usr/local/spark/spark24/bin $ spark-shell ERROR Traceback (most recent call last): File "/usr/local/...
0
votes
0answers
33 views

Random order of transformation execution issue

We have an issue with order of execution of Spark transformations that seems to be arbitrary. We have 2 RDDs with related events that we classify. Multiple classifiers are applied to rdd1, but for ...
0
votes
1answer
28 views

Is .withColumn and .agg calculated in parallel in pyspark?

Consider for example df.withColumn("customr_num", col("customr_num").cast("integer")).\ withColumn("customr_type", col("customr_type").cast("integer")).\ agg(myMax(sCollect_list("customr_num"))....
0
votes
0answers
9 views

Passing SparkConf settings from the command line to Spark

I'm trying to have spark work with Azure Blob Storage data. The way to pass credentials to it is as follows: spark.conf.set( "fs.azure.account.key.STORAGE_ACCOUNT.blob.core.windows.net", "KEY") ...
0
votes
0answers
14 views

Spark: not able to read file from HDFS

I have a file stocked in HDFS, and I can see it using hdfs dfs -ls /user. I can also read it using: text_RDD = sc.textFile("hdfs://localhost/user/testfile1.txt") text_RDD.take(1) But strangely when ...
-4
votes
0answers
24 views

Spark scala sparkContext.addFile cannot retrieve file with sparkfiles.get

I'm trying to write a simple program to submit to a spark cluster, I need to download a json file stored in an http server and process it. The cluster is composed of the master node and 1 worker node, ...
-4
votes
0answers
29 views

How to schedule jobs in Apache Hadoop for Hive & Spark scripts [on hold]

M new to BigData env. M trying to understand , I have few ETL scripts that i have created in HIve & Spark and now i wanted to schedule them to run on daily basis. Can somebody tell me how to ...
0
votes
0answers
21 views

DropDuplicates is not giving expected result

I am working on a use-case of removing duplicate records from incoming structured data (in the form of CSV files within a folder on HDFS). In order to try this use-case, I wrote some sample code using ...
-1
votes
0answers
4 views

Spark: Hardware Requirements for Cluster Manager [on hold]

I'm planning on setting up a Spark cluster for some ETL jobs. How much computing resources should I allocate for the cluster manager (I was thinking of using YARN as a cluster manager) ?
0
votes
0answers
10 views

Spark 1.2 on AWS EMR with emr release > 5.00

I use AWS EMR to run some java Spark application. Is it possible to install Spark with version 1.2 as bootstrap action for emr release > 5.0? When I trying to do is, I get error that such bootstrap ...
0
votes
0answers
14 views

spark dataframe still shows CRLF in windows notepad++

I am creating a tempview in spark using df.createOrReplaceTempView function. After creating the view, I am applying a sql on the last column to remove the carriage return. Given below is a sample code....
0
votes
1answer
16 views

Checking if elements of a tweets array contain one of the elements of positive words array and count

We are building sentiment analysis application and we converted our tweets dataframe to an array. We created another array consisting of positive words. But we cannot count the number of tweets ...
0
votes
0answers
15 views

Reading Avro file with schema into Spark Dataset using CASE class

I'm struggling to find suitable example on reading Avro file into Spark Dataset using Case class. I would like to use spark dataset rather than Data frame. I would like to maintain my Avro schema in ...
0
votes
0answers
8 views

Specify hbase-site.xml to mapreduce job from shell script

My mapreduce job is provided with hbase-site.xml when invoked from shell script. hadoop jar Map-reduce.jar \ org.test.hadoop.driver.MytestDriver -libjars /test/hbase-site.xml \ -...
0
votes
0answers
6 views

trigger.Once() metadata needed

Hi guys simple question for experienced guys. I have a spark job reading files under a path. I wanted to use structured streaming even when the source is not really a stream but just a folder with a ...
-1
votes
2answers
41 views

remove zeros in the middle of string (but keep the ones at the end) pyspark

I need to remove zeros that are in the middle of a string, while keeping the ones at the end (in pyspark). So far I have only found regex that remove leading or trailing zeros. Example: df1 = spark....
0
votes
0answers
13 views

Databricks fails accessing a Data Lake Gen1 while trying to enumerate a directory

I am using (well... trying to use) Azure Databricks and I have created a notebook. I would like the notebook to connect my Azure Data Lake (Gen1) and transform the data. I followed the documentation ...
0
votes
0answers
20 views

Bitwise ORing select columns in Spark Dataframe

I have a spark dataframe dist with following schema: +-----+-----+-----+-----+-----+ | id1 | id2 | _c1 | _c2 | _c3 | +-----+-----+-----+-----+-----+ | int | int | bin | bin | bin | +-----+-----+-----+...
-3
votes
0answers
32 views

How to find correlation between two columns of data frame in spark Scala?

I am working with a big DataFrame. But I am trying to get correlation between two columns. I used this code in Scala: val corr = df.stat.corr("BEN","O_3")
1
vote
0answers
26 views

How to connect array of vectors to dataframe in spark?

I have: clustercenters=model.clusterCenters from a kmeans model in org.apache.spark.ml.clustering.KMeans The result is Array[org.apache.spark.ml.linalg.Vector] I want to convert this in a dataframe ...
0
votes
2answers
55 views

Picking same column of multiple rows into one rows of multiple column

I have below Two DF MasterDF NumberDF(Creating using Hive load) Desire output: Logic to populate For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client' For Field2 need to pick sch_id ...
1
vote
0answers
28 views

Use Spark SQL JDBC Server/Beeline or spark-sql

In Spark SQL, there are two options to submit sql. spark-sql, for each sql, it will kick off a new Spark application. Spark JDBC Server and Beeline, The Jdbc Server is actually a long running ...
0
votes
1answer
38 views

To print output of SparkSQL to dataframe

I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console However when I try to write the output to a DF I could not see the ...
-2
votes
0answers
26 views

when using when condition with withColumn() getting such error “Can't convert 'int' object to str implicitly”

Having a dataframe of schema : monthsdiff:double date:date end_mon:double creation_mon:double creation_date | invoice_date | creation_month | end_month | monthsdiff 2017-07-18|2018-02-...
0
votes
0answers
26 views

Unable to import spark avro package in spark 2.3 “import org.apache.spark.sql.avro._” in Intellij

Import Error in intellij import org.apache.spark.sql.avro._ image : https://i.stack.imgur.com/MvT19.png Spark 2.3 doesnt support databrick avro, so steps to support avro below spark version 2.4 is ...
-2
votes
0answers
19 views

Pyspark Help Required to add extra column [on hold]

I have a Dataframe created from RDD(used csv) by adding schema using StructFields. Now i need to add an extra column in which the data should be name of the existing columns in which we have value as '...
-3
votes
0answers
19 views

Spark-Scala application in Template Design Pattern

I am developing a Spark-Scala application, in which I am planning to use Template Design Pattern. Here is the proposed design. ProjectTemplate.scala => This is a trait containing functions such as ...
-1
votes
0answers
16 views

ShuffledRDD Vs CoGroupedRDD

I created two rdds - rdd1 and rdd2. rdd1 had 10 partitions. rdd2 had 5 partitions. Both rdds does not have any partitioners specified. Then I executed the below code. val joined = rdd1.join(rdd2) ...
-1
votes
1answer
22 views

Spark Scheduler pool jobs are not running parallel as I expected

I am trying to run two spark actions as below and I expect them to run parallely as they both use differenct pools. Does scheduling using pools meant that, different independent actions will run ...
0
votes
0answers
20 views

spark-submit works for yarn-cluster mode but SparkLauncher doesn't, with same params

I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI) Below is the ...
-3
votes
1answer
48 views

Compare each row in dataframe with other row and merge them if the content matches

I have a dataframe like this: +----+---------------------+---------------------+-------+ |acct|conds |benes |amount | +----+---------------------+---------------------+--...
0
votes
1answer
49 views

New to Pyspark - importing a CSV and creating a parquet file with array columns

I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The ...
0
votes
1answer
25 views

why --packages command let the python package not available or loadable from the Spark client/driver?

I want to add graphframes library.Normaly this library is added by (for example): pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11 and then you should get something like: ...
-2
votes
0answers
26 views

How to manage writing into csv file with header dynamically

val rdd = df.rdd.map(line => Row.fromSeq( "BNK" :: format.format(Calendar.getInstance().getTime()) :: line(0) :: scala.xml.XML.loadString("<?xml version='1.0' ...
0
votes
1answer
24 views

MapReduce Spark data processing with rdd (scala)

I have a big data and i want to use mapRuduce on this data and i don't find anything for this task. (Language:Scala) The data for this process is: Y,20,01 G,18,40 J,19,10 D,50,10 R,20,01 Z,18,40 T,...
0
votes
1answer
19 views

Can I mark POJO as Hibernate entity, while this POJO is used by Apache Spark?

The project I currently worked with has some POJO files that being used by Spark in a way below: JavaRDD<MyPojo> = ... sqlContext.createDataFrame(rdd, MyPojo.class); However, I also ...
0
votes
1answer
16 views

Enabling Spark Web UI in AWS EMR

I am submitting a Spark job on EMR cluster and I want to see the Spark Web UI which gives the information about the configuration and status of the master node and also worker node. Configuration ...
0
votes
0answers
31 views

How RDD take() method works internally?

I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen? Does it maintain indexes ...
0
votes
0answers
8 views

What does Apache Spark ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 6237761604204073263 mean?

We are occasionally receiving this error in Amazon Elastic Map Reduce using Apache Spark: 19/03/21 16:25:39 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id ...
0
votes
0answers
7 views

How to get timedout state using mapgroupswithstate without calling the function on apache-spark

We ran into an issue with mapGroupswithState stateful processing. Requirement : Ingest all the data and groupBy key and apply a timeout for that state. Then after the timeout expires we want to ...
1
vote
1answer
29 views

Does WARN Client: Same path resource file:///tmp/programs95923.zip added multiple times to distributed cache matter?

We have a large Apache Spark application running in Amazon EMR. I am trying to get rid of all the WARN messages in the logfile. When our application starts, we make a ZIP file of the program's Python ...
0
votes
0answers
13 views

Does Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME matter?

We are running a large Spark application at Amazon Elastic Map Reduce. I've been working hard to remove all WARN messages in the logfile. This is one of two remaining: 19/03/21 14:08:09 WARN Client: ...

http://mssss.yulina-kosm.ru