Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users. Use Cases for Apache Spark often are related to machine/deep learning, graph processing.

0
votes
0answers
7 views

java.io.FileNotFoundException: ./src/main/resources/config.properties(no such file or directory) in spark-submit

I get the config in the way of in = instance.getClass().getResourceAsStream("/config.properties"); props.load(in); it works fine in my win7 idea, when run in the way of spark-submit , suffered a ...
0
votes
1answer
14 views

Building derived column using Spark transformations

I got a table record as stated below. Id Indicator Date 1 R 2018-01-20 1 R 2018-10-21 1 P 2019-01-22 2 R 2018-02-28 2 P 2018-05-22 2 ...
0
votes
0answers
16 views

Combining results of multiple data frames: merging file in Scala

I've a requirement of merging files written by 3 different data frames in HDFS into a single file and finally putting them to local filesystem. This task is a child's play in normal Hadoop commands ...
0
votes
1answer
18 views

spark scala transform a dataframe/rdd

I have a CSV file like below. PK,key,Value 100,col1,val11 100,col2,val12 100,idx,1 100,icol1,ival11 100,icol3,ival13 100,idx,2 100,icol1,ival21 100,icol2,ival22 101,col1,val21 101,col2,val22 101,idx,...
1
vote
0answers
18 views

How to join three DStreams in Spark streaming using Python

I have three kafka producers which are sending data streams on the same topic at random intervals between 5-10 seconds. There is a Spark consumer (python based ) which is consuming the data. The ...
1
vote
1answer
22 views

How to return distinct sets in a PySpark RDD?

I have an RDD, with a different set of values, and I want to return all the distinct sets from the original RDD. Is there any key term such as distinct? example = sc.parallelize([{1}, {2}, {3}, {1}]) ...
0
votes
0answers
19 views

Pyspark - ranking column replacing ranking of ties by Mean of rank

Consider a data set with ranking +--------+----------+-----+----+---------+---------+ | | colA|colB|colA_rank|colA_rank_mean| +-------------------+-----+----+---------+---------+ | 21| 50| ...
0
votes
0answers
6 views

Errors Creating Simple Azure HDInsight Spark Cluster with Pulumi

I am attempting to use the Pulumi Javascript SDK to create a HDInsight Spark Cluster on Azure. I have followed the tutorials Pulumi provides on creating a "hello world" GCP Kubernetes cluster and gone ...
1
vote
0answers
13 views

How to fix '[errno 2] no such file or directory' error in apache-spark

I am trying to set up apache-spark with python and visual studio code. I followed a tutorial up to this point and am not getting any errors with my code but when I try to run the code spark-submit ...
0
votes
0answers
10 views

error while writing Dataframe into HDFS path [duplicate]

while writing a Dataframe into HDFS path failed df_export_data.write.format("csv").option("header", "true").option("delimiter","|").partitionBy("sales_org_cd").mode("append").save("/tmp/test/") ...
0
votes
0answers
15 views

How to zip hundreds of Spark RDDs into one

I need to pipe to many various scripts and denormalize all results into a single table (matching this use case perfectly), but I've reached a logical conundrum in doing so: Do I only pipe the columns ...
-2
votes
0answers
13 views

Pyspark converting date string (format - YYYYMMDDHHMMSS) to epoch timestamp

i'm using pyspark (Apache Spark 2.2, Python 2), and trying to convert String in format - YYYYMMDDHHMMSS to timestamp column. However, i'm getting null values Code -> from pyspark.sql.functions ...
0
votes
0answers
15 views

Setting up py4j correctly with jupyter-notebook

I am trying to set up spark with jupyter-notebook. I followed an online tutorial and it seems to be working except I keep getting errors related to py4j. I have tried a couple different code ...
1
vote
0answers
15 views

Spark Structured Streaming unable to write parquet data to HDFS

I'm trying to write data to HDFS from a spark structured streaming code in scala. But I'm unable to do so due to an error that I failed to understand On my use case, I'm reading data from a Kafka ...
0
votes
0answers
13 views

How to completely clean Spark Session?

My use-case revolves around either restarting/cleaning the SparkSession per job execution. When I have to clean the SparkSession without restarting, I'm currently only doing sparkSession.catalog()....
0
votes
1answer
16 views

How to retain completed applications after yarn server restart in spark web-ui

I am using yarn resource manager for spark. after restart of yarn server, all completed jobs in spark-webui disappered. Below two properties added in yarn-site.xml Can someone explain me what could ...
-5
votes
0answers
29 views

Checkpointing RDBMS table in Spark (Scala)

I have my spark scala SBT project where I am writing data to different RDBMS tables , How can I validate that data are written to the tables ? is there any checkpointing table that we can use?
0
votes
0answers
11 views

Can't connect to Hive server with spark JDBC in kerberised cluster

I try to read data from one Hive (hive n°1) and write result into another Hive (hive n°2) ( they are from 2 different cluster ). I can't use a single spark session to connect to both Hive, so i will ...
1
vote
1answer
27 views

Why AWS is rejecting my connections when I am using wholeTextFiles() with pyspark?

I use sc.wholeTextFiles(",".join(fs), minPartitions=200) to download 6k XMLs files from S3 (every file 50MBs) on single dataproc node with 96cpus. When I have minPartitions=200 AWS is rejecting my ...
0
votes
0answers
8 views

Active master's IP and mtime not updated in master_status znode created by spark

I have configured Spark for HA using SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zkServer1:2181,zkServer2:2181,zkServer3:2181" So, it creates two ...
-2
votes
0answers
14 views

pyspark: tried several ways to print pipline rdd but not worked ,u"\nmismatched input '<EOF>'

I have a pipline rdd, and I tried several ways to print it. I use map function to implement function 'nodes' on every element in data rdd. The 'data.map(nodes)' is of type 'pipline dataframe'. The '...
1
vote
0answers
14 views

Count operation not working on aggregated IgniteDataFrame

I am working with Apache Spark and Apache Ignite. I have a spark dataset which I wrote in Ignite using following code dataset.write() .mode(SaveMode.Overwrite) ....
-1
votes
1answer
28 views

Create dataframe on printschema output

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output df = spark.read.parquet("s3/location") df....
0
votes
2answers
34 views

Is there a way to get the value from a column at a specifc row and put it to the next row?

I have Data that looks the following ID Sensor No 1 specificSensor 1 2 1234 null 3 1234 null 4 specificSensor 2 5 2345 null 6 ...
3
votes
1answer
36 views

Hbase doesn't work well with spark-submit

I have an app that does some work and at the end it needs to read some file from hdfs and store it into hbase. The app runs when using master local with no issue using apache spark, but when I run it ...
0
votes
0answers
16 views

Apache Spark Temp files Size on DIsk

I have a setup where the incoming data from Kafka cluster is processed by Apache Spark streaming job. Version Info :- Kafka = 0.8.x Spark Version = 2.3.1 Recently when the capacity of Kafka cluster ...
0
votes
1answer
11 views

How to read stream of structured data and write to Hive table

There is a need to read the stream of structured data from Kafka stream and write it to the already existing Hive table. Upon analysis, it appears that one of the options is to do readStream of Kafka ...
0
votes
2answers
37 views

Pyspark - Aggregate all columns of a dataframe at once

I want to group a dataframe on a single column and then apply an aggregate function on all columns. For example, I have a df with 10 columns. I wish to group on the first column "1" and then apply an ...
0
votes
0answers
11 views

Class org.apache.oozie.action.hadoop.SparkMain not found

following are all oozie files which i have been using to run job. I have created folder on hdfs /test/jar and put workflow.xml and coordinator.xml file. Properties File nameNode=hdfs://host:8020 ...
1
vote
0answers
28 views

How to guarantee process once only for REST API calls for all records of a SPARK dataframe

I wanted to use foreachPartition on a dataframe to send data of each row ONLY once to a REST API. val aDF= ... ///sc.parallelize(0 to 1000000,4) i.e a dataframe ~1M rows aDF.foreachPartition(rows =&...
-3
votes
0answers
17 views

comment construire l'architecture maitre-esclave à l'aide des machines virtuelles pour executer spark [on hold]

salut, je suis une débutante, j'ai créer un projet spark(scala) je l'ai exécuté en mode local (mon master=local[*]), mais maintenant je veux exécuter mon code scala en basant sur l'architecture ...
3
votes
0answers
25 views

Hive partitioned table reads all the partitions despite having a Spark filter

I'm using spark with scala to read a specific Hive partition. The partition is year, month, day, a and b scala> spark.sql("select * from db.table where year=2019 and month=2 and day=28 and a='y' ...
0
votes
0answers
23 views

How to convert string date in first column according to format string from second column [duplicate]

I have Apach Spark`s dataframe with two column. First column contains string representations of dates, second- string with date formats. +------------+------------+ | date | format | +------...
-1
votes
1answer
15 views

Add column to a Dataset based on the value from Another Dataset

I have a dataset dsCustomer that have the customer details with columns |customerID|totalAmount| |customer1 | 250 | |customer2 | 175 | |customer3 | 4000 | I have another dataset ...
0
votes
0answers
15 views

How spark decides the no. of partitions/tasks to create when it reads from Hive

Let's say: We have a table stored in Hive partitioned on the date. For example: we have a table called Person and a partition inside it called datestr=2019-01-01 and it is stored in Parquet format(...
0
votes
2answers
35 views

How to reduceByKey in PySpark with custom grouping of rows?

I have a dataframe that looks as below: items_df ====================================================== | customer item_type brand price quantity | |======================================...
-1
votes
0answers
28 views

Is there any way i can create a Spark Dataframe from result set of SparkSQL returning result From for loop

I am trying to find differences b/w source and target columns by iterating columns using for loop and passing them into sparksql My tables : source and target My input = dictionary = {source_col1 : ...
0
votes
0answers
14 views

SnappyData REST API to Submit Job

I am trying to submit Snappy Job using REST API. We have been able to submit SnappyJob using snappy-job submit Command Line tool. I could not find any documentation how to do the same thing through ...
0
votes
1answer
26 views

java.lang.ClassNotFoundException with spark-submit [duplicate]

I installed spark 2.4.3 and sbt 1.2.8. I'm under windows 10 pro. java -version gives: java version "1.8.0_211" Java(TM) SE Runtime Environment (build 1.8.0_211-b12) Java HotSpot(TM) 64-Bit Server VM ...
0
votes
0answers
21 views

Where is the sort ordering algorithm in spark source code?

I am doing a simple query: spark.sql("SELECT * FROM mytable ORDER BY age").collect() My question is: In Spark Source Code, where can I find the ordering comparator between rows, which executes this ...
-1
votes
1answer
15 views

Does anybody know how to use the Approximate Nearest Neighbor Search provide by Spark MLlib?

I want to use the Approximate Nearest Neighbor Search provide by Spark MLlib (ref.) but I'm super lost because I didn't find an example or something to guide me. The only info provided for the ...
0
votes
0answers
15 views

Apache spark (2.4.2) with Hive MetaStore 3

I am trying to connect Spark 2.4 to Hive Metastore 3 to catalog ORC files on S3. Spark Configuration: sparkConf .set("spark.sql.catalogImplementation", "hive") ....
-2
votes
1answer
17 views

What is memory, vcores and disks on yarn scheduler page?

Can someone explain used resources, Min resources and Max Resources in detailed ? With all specific details about the memory units ?
0
votes
3answers
35 views

Compare sum of number of words in string columns with value in another column

I have a spark DataFrame consisting of 3 columns: text1, text2 and number. I want to filter this DataFrame based on the following constraint: (len(text1)+len(text2))>number where len returns the ...
2
votes
0answers
23 views

How to include both “latest” and “JSON with specific Offset” in “startingOffsets” while importing data from Kafka into Spark Structured Streaming

I have a streaming query saving data into filesink. I am using .option("startingOffsets", "latest") and a checkpoint location. If there is any down time on Spark and when the streaming query starts ...
0
votes
0answers
7 views

Error opening block StreamChunkId : BlockNotFoundException

I am getting some transient excepting in using spark-streaming with Amazon Kinesis with storage level "MEMORY_AND_DISK_2". We are using Spark 2.2.0 with emr-5.9.0. 19/05/22 01:56:16 ERROR ...
-1
votes
1answer
43 views

how to add a list values into tuples

I tried to add the values of an existing list into a tuple. It's showing no compiler error but throwing runtime error. I hardcoded some list values and tried to add the values into a tuple using java ...
2
votes
2answers
37 views

NoSuchMethodError? code is ok in idea, but wrong on cluster

My code runs okay on my win7 idea64, but when i package the code, and run it on yarn cluster, it throws an expection: java.lang.NoSuchMethodError: org.apache.commons.beanutils.PropertyUtilsBean....
-2
votes
0answers
33 views

Is there any way to destroy spark RDD?

My code frequently creates RDDs due to frequent transformations to a huge data-set. It is getting out of memory. Is there any way to destroy existing RDD. I have a graph data-set for which I am ...
0
votes
2answers
34 views

How to increase the “memory total” that display on Yarn UI?

I have a cluster on EMR (emr-5.20.0) with a m5.2xlarge as Node Master, two m4.large as core and three m4.large as node workers. The sum of memory ram of this cluster is 62GB, but in the YARN UI the ...

http://mssss.yulina-kosm.ru