Questions tagged [apache-spark]
Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users. Use Cases for Apache Spark often are related to machine/deep learning, graph processing.
I use scala 2.11.11 & spark 2.3 and want to process a large rdd (~25GB) but got with
// [example code]
val conf= new
val spark =
Is it possible in pyspark to create dictionary within groupBy.agg()? Here is a toy example:
from pyspark.sql import Row
import pyspark.sql.functions as F
sc = pyspark.SparkContext()
I have rdd which am trying to write into hive table which is located in remote cluster.
I have input dataframe like this
When doing the re-partitioning spark is breaking the lazy-evaluation-chain and triggers the error which I cannot control/catch.
//simulation of reading a stream from s3
def readFromS3(partition: Int)...
I am dealing with a computationally intensive package in R. This package has no alternative implementations that interfaces with a Spark cluster;however, it does have an optional argument to take in a ...
I am trying to populate a table structure using Dataframes. Here column of the table is initially being represented by a dataframe which would be in turn filled by a scala stream backed by a regular ...
I am developing a python package which will be deployed into databricks cluster. We often need reference to the "spark" and "dbutils" object within the python code.
We can access these objects ...
Hello I've unzip and exported spark path. Whne I lunch it I got this error.
Traceback (most recent call last):
We have an issue with order of execution of Spark transformations that seems to be arbitrary.
We have 2 RDDs with related events that we classify. Multiple classifiers are applied to rdd1, but for ...
Consider for example
I'm trying to have spark work with Azure Blob Storage data. The way to pass credentials to it is as follows:
I have a file stocked in HDFS, and I can see it using hdfs dfs -ls /user.
I can also read it using:
text_RDD = sc.textFile("hdfs://localhost/user/testfile1.txt")
But strangely when ...
I'm trying to write a simple program to submit to a spark cluster, I need to download a json file stored in an http server and process it. The cluster is composed of the master node and 1 worker node, ...
M new to BigData env.
M trying to understand , I have few ETL scripts that i have created in HIve & Spark and now i wanted to schedule them to run on daily basis.
Can somebody tell me how to ...
I am working on a use-case of removing duplicate records from incoming structured data (in the form of CSV files within a folder on HDFS). In order to try this use-case, I wrote some sample code using ...
I'm planning on setting up a Spark cluster for some ETL jobs.
How much computing resources should I allocate for the cluster manager (I was thinking of using YARN as a cluster manager) ?
I use AWS EMR to run some java Spark application.
Is it possible to install Spark with version 1.2 as bootstrap action for emr release > 5.0?
When I trying to do is, I get error that such bootstrap ...
I am creating a tempview in spark using df.createOrReplaceTempView function. After creating the view, I am applying a sql on the last column to remove the carriage return.
Given below is a sample code....
We are building sentiment analysis application and we converted our tweets dataframe to an array. We created another array consisting of positive words. But we cannot count the number of tweets ...
I'm struggling to find suitable example on reading Avro file into Spark Dataset using Case class. I would like to use spark dataset rather than Data frame.
I would like to maintain my Avro schema in ...
My mapreduce job is provided with hbase-site.xml when invoked from shell script.
hadoop jar Map-reduce.jar \
org.test.hadoop.driver.MytestDriver -libjars /test/hbase-site.xml \
Hi guys simple question for experienced guys.
I have a spark job reading files under a path.
I wanted to use structured streaming even when the source is not really a stream but just a folder with a ...
I need to remove zeros that are in the middle of a string, while keeping the ones at the end (in pyspark). So far I have only found regex that remove leading or trailing zeros.
df1 = spark....
I am using (well... trying to use) Azure Databricks and I have created a notebook.
I would like the notebook to connect my Azure Data Lake (Gen1) and transform the data. I followed the documentation ...
I have a spark dataframe dist with following schema:
| id1 | id2 | _c1 | _c2 | _c3 |
| int | int | bin | bin | bin |
I am working with a big DataFrame. But I am trying to get correlation between two columns. I used this code in Scala:
val corr = df.stat.corr("BEN","O_3")
from a kmeans model in org.apache.spark.ml.clustering.KMeans
The result is Array[org.apache.spark.ml.linalg.Vector]
I want to convert this in a dataframe ...
I have below Two DF
NumberDF(Creating using Hive load)
Logic to populate
For Field1 need to pick sch_id where CAT='PAY' and SUB_CAT='client'
For Field2 need to pick sch_id ...
In Spark SQL, there are two options to submit sql.
spark-sql, for each sql, it will kick off a new Spark application.
Spark JDBC Server and Beeline, The Jdbc Server is actually a long running ...
I'm currently running Analyze command for particular table and could see the statistics being printed in the Spark-Console
However when I try to write the output to a DF I could not see the ...
Having a dataframe of schema :
creation_date | invoice_date | creation_month | end_month | monthsdiff
Import Error in intellij
image : https://i.stack.imgur.com/MvT19.png
Spark 2.3 doesnt support databrick avro, so steps to support avro below spark version 2.4 is ...
I have a Dataframe created from RDD(used csv) by adding schema using StructFields. Now i need to add an extra column in which the data should be name of the existing columns in which we have value as '...
I am developing a Spark-Scala application, in which I am planning to use Template Design Pattern.
Here is the proposed design.
ProjectTemplate.scala => This is a trait containing functions such as ...
I created two rdds - rdd1 and rdd2. rdd1 had 10 partitions. rdd2 had 5 partitions. Both rdds does not have any partitioners specified. Then I executed the below code.
val joined = rdd1.join(rdd2)
I am trying to run two spark actions as below and I expect them to run parallely as they both use differenct pools. Does scheduling using pools meant that, different independent actions will run ...
I'm able to submit a spark job through spark-submit however when I try to do the same programatically using SparkLauncher, it gives me nothing ( I dont even see a Spark job on the UI)
Below is the ...
I have a dataframe like this:
|acct|conds |benes |amount |
I am new to Pyspark and I've been pulling my hair out trying to accomplish something I believe is fairly simple. I am trying to do an ETL process where a csv file is converted to a parquet file. The ...
I want to add graphframes library.Normaly this library is added by (for example):
pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11
and then you should get something like:
val rdd = df.rdd.map(line => Row.fromSeq(
:: scala.xml.XML.loadString("<?xml version='1.0' ...
I have a big data and i want to use mapRuduce on this data and i don't find anything for this task. (Language:Scala)
The data for this process is:
The project I currently worked with has some POJO files that being used by Spark in a way below:
JavaRDD<MyPojo> = ...
However, I also ...
I am submitting a Spark job on EMR cluster and I want to see the Spark Web UI which gives the information about the configuration and status of the master node and also worker node.
I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen?
Does it maintain indexes ...
We are occasionally receiving this error in Amazon Elastic Map Reduce using Apache Spark:
19/03/21 16:25:39 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id ...
We ran into an issue with mapGroupswithState stateful processing.
Requirement : Ingest all the data and groupBy key and apply a timeout for
that state. Then after the timeout expires we want to ...
We have a large Apache Spark application running in Amazon EMR. I am trying to get rid of all the WARN messages in the logfile.
When our application starts, we make a ZIP file of the program's Python ...
We are running a large Spark application at Amazon Elastic Map Reduce. I've been working hard to remove all WARN messages in the logfile. This is one of two remaining:
19/03/21 14:08:09 WARN Client: ...