Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

0
votes
0answers
28 views

Pipeline Imputation fail on 1st pass

First, the sample data I am working on: df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10)], index = range(0, vsize * 3, 3)) df_2 = ...
0
votes
0answers
7 views

Is it right to impute Train and Test set? [on hold]

So I'm experimenting with a Databset , and I had a couple of columns with high cardinality , so I had to perform mean target encoding ( giving that my dataset had more than 50000 observations) , but ...
0
votes
0answers
14 views

Extrapolate missing data for grouped by Pandas dataframe

I have a dataset with multiple countries and years. Each country-year combo is a row in the data. Across the columns, there are multiple variables. Some of them have the last few years of data missing ...
3
votes
1answer
55 views

variable fillna() in each column

For starters, here is some artificial data fitting my problem: df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), columns = ["col_{}".format(x) for x in range(10)], ...
1
vote
1answer
38 views

Pandas & fillna based on groups

I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation. I have a DataFrame, which holds a dataset for later Machine Learning. ...
1
vote
1answer
33 views

NA imputation based on group by and adjacent variable

Data Set df <- data.frame(ID = c(55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 66, 66, 66, 66, 66, 66, 66, 66, 66, 66), counter = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0
votes
2answers
31 views

How to use parallel computing for missRanger in imputation of missing values?

I am imputing missing values by missRanger and it takes too long as I have 1000 variables. I tried to use parallel computing, but it does not make the process faster. Here is the code library(...
-1
votes
1answer
43 views

replacing NA values with specific averege

i have a data.frame with columns and rows. how could i replace NA values so that it would be the average of the first value before and after that cell in that column? for example: 1. 1 2 3 2. 4 ...
0
votes
2answers
29 views

Descriptive data with mice/miceadds

I have used mice/miceadds to carry out multiple imputation. I am interested in getting a number of descriptive stats on a "pooled dataset" Question: 1) I want to know the % of values that are above a ...
-2
votes
1answer
31 views

Missing value imputaion in python

By doing df.groupby('acc_count', as_index=False)['avg_spd'].median() I got acc_count avg_spd 0 20.94 1 24.42 2 26.035 3 ...
0
votes
0answers
36 views

How to install knnimpute package on anaconda

I want to install knnimpute package using anaconda, I'm tried to run conda install -c kgullikson knnimpute and the output is: Solving environment: failed UnsatisfiableError: The following ...
0
votes
0answers
17 views

Restarting R After Imputation

I am using the MICE package to do a multiple imputation on my data. I have 7 variables so the imputation took about 5 hours on my little laptop. This morning, I went back to work but needed to restart ...
1
vote
1answer
28 views

How to impute means into specific observations in a column?

I have an assignment at the moment including a table of data that includes information about observations of species of animals being measured on different occasions. In the 'weight' column of my data ...
0
votes
0answers
24 views

Bagged tree imputation method not working for big data?

I have a huge data with 1000 variables and 200 subjects. Before starting to run the machine learning algorithm, I have to impute missing values. I use bagged tree imputation method. The problem is ...
3
votes
1answer
37 views

Inserting missing rows with imputed values in Python

Problem How can you insert rows for missing YEARS, with imputed annual SALES. Progress The following code computes the sales differences. However, it is for one year, using the explicit iloc ...
0
votes
1answer
41 views

Impute different types of variables with MICE

I am trying to perform imputation on a dataset which has 69 columns and over 50000 rows. My dataset has different types of variables: columns that only present binary variables (0,1) categorical ...
1
vote
0answers
45 views

How to impute data with exclusive binary variables in R?

I have a dataset with 69 columns and over 50000 rows which is structured like this: Some of the columns can only take 0 or 1 values (binary), for example:'isFemale', 'isChild', etc. Some other ...
0
votes
1answer
28 views

Steps to perform correct data analysis

I have a dataset with 69 columns and 50000 rows. My dataset only contains binary variables and numerical variables. Moreover, some of the binary variables have some missing values (about 5%). I know ...
1
vote
0answers
35 views

R: Ordered logistic regression with multiple imputation data (amelia package)

I am analyzing data from European Social Survey. Due to quite a bit of missing data I have used the amelia package for imputation. The dependent value is ordinal with 4 values, and I had therefore ...
0
votes
1answer
20 views

Replacing Nulls for one Variable based off another

I have a dataset consisting of measured variables and categorical variables based off these measurements. i.e X1 is measured variable and Y1 will either be 0 or 1 based off the measurement in X1. ...
0
votes
1answer
70 views

How do I correctly impute these NaN values with modes of another column?

I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values. My data concerns a bicycle-share system and ...
1
vote
1answer
54 views

How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds". df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x....
2
votes
2answers
89 views

How to deal with NaN values where imputation doesn't make sense? (for PCA)

I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to ...
1
vote
1answer
184 views

Grid search CV with sklearn own estimator in python

I am trying to build my own estimator (regressor) and use it for imputation (KnnImputation). I'm having a problem using the grid search "GridSearchCV". Any ideas what is the problem? My Code: class ...
0
votes
0answers
25 views

Problem With Imputing DataFame's Columns With MissForest

So I'm trying to impute the columns of a DataFrame , but I get This Error . (This Is An Imputation For One Specific Column) from missingpy import MissForest imputer = MissForest() ...
2
votes
2answers
89 views

Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C). I also have a continuous variable with some missing values on it. I would like to replace the NA values with the mean of its group. ...
0
votes
1answer
28 views

grid search on own estimator with a continuous target in python

I wrote a KNN imputation implementation and I wanted the StratifiedKFold to check what K and what distance matrix to use. I got an error: it seems it doesn't recognize my estimator as a regressor (...
0
votes
0answers
21 views

Why do I get extremely large ranges using the imputeData() function in mclust package in R?

original.dat <- read_sav(file ="N219.sav") View(original.dat) imp.dat <- imputeData(original.dat[,-(129:165)]) View(imp.dat) I still get an imputed dataset, it's just that the ranges for an ...
0
votes
2answers
41 views

Imputation of Missing Values by Categorical Mean?

I have a dataset with several columns, one of which is missing chunks of data that is needed. The column with missing data, df$Variable, is always attributed to a specific person, df$Name. Is there a ...
1
vote
0answers
20 views

knn imputation of all missing values

I have a large data set with a lot of missing values. Some variables up to 30%. Deletion is not an option. What would be the best way to impute? For KNN, when I run df_KNN = pd.DataFrame(data=KNN(k=...
0
votes
0answers
24 views

Using Pipeline to Avoid Data Leakage both for X and y

I followed the example at this link very closely: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html but used a different dataset (found here: https://...
1
vote
1answer
61 views

Multiple imputation in R (mice) - How do I test imputation runs?

I work with a data set of 171 observations of 55 variables with 35 variables having NA's that I want to impute with the mice function: imp_Data <- mice(Data,m=5,maxit=50,meth='pmm',seed=500) ...
0
votes
0answers
36 views

Import Imputed Data from SAS to R

I am working on a project where I get the imputed data from a colleague who uses SAS and I want to analyze it in R. The problem is that I import it into R as a dataframe using: final<-read....
2
votes
1answer
36 views

How to write a function that imputes missing numeric and character values?

I have the following sample data: ID GLUC TGL HDL LDL HRT MAMM SMOKE A 88 NA 32 99 Y NA never B NA 150 60 NA NA no never C 110 NA NA 120 N NA NA D NA 200 65 165 ...
0
votes
1answer
49 views

Fill missing value with mean of another variable based on categories in R [duplicate]

I want to replace NA values in val2 in each row with the mean of val corresponding to that ID column. Any easy (tidyverse) way to do this? Also, I want to know how to replace it by mean(na.rm=TRUE) ...
1
vote
2answers
104 views

Generate larger synthetic dataset based on a smaller dataset in Python

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use ...
0
votes
0answers
36 views

Replace Impute Resample values in a dataframe column(s) on a condition in Python

I have a time series sensor data. Each column (SENSOR1, SENSOR2, …) has sensor readings values and also ‘Not in Service’ as 'Serv', ‘Fail’, ‘Config’. They needs to be replaced with smth meaningful. ...
0
votes
1answer
23 views

per group assign to every start time the latest end time and transport mode that belongs to the highest ID in R

I have a data manipulation problem for which I can solve both imputation individually but not both simultaneously. I have a dataset of tracks which is grouped by ID (different persons), each track has ...
0
votes
1answer
42 views

Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with ...
0
votes
1answer
13 views

R language Amelia specify prefix of output files

This R statement uses the Amelia package to create output data files containing imputed data: ds.im <- amelia(ds, m=5, p2s=2) The names of the 5 output files are: output1.csv to output5.csv In ...
0
votes
1answer
63 views

Pandas, replace NaNs with values from MultiIndex DataFrame

Problem I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way ...
1
vote
2answers
57 views

Impute missing values in timeseries via bsts

I work with a minutely timeseries with about 20% missing data (in varying lengths). AFAIK bayesian methods can handle missing data elegantly and I would like to try to fit a bayesian timeseries model ...
0
votes
0answers
132 views

NA not permitted in predictors. missForest

I am using missForest in order to impute missing data. I have the data as a data frame and when I put it into the missForest function I get the error: Error in randomForest.default(x = obsX, y = ...
0
votes
0answers
36 views

SAS proc mianalyze EDF

We have a cluster randomized trial with a small number of clusters, The primary endpoint is measured at follow-up and we have missing data. We proposed to conduct a linear mixed model including ...
3
votes
2answers
56 views

Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows. For example, say I have the dataframe ...
-1
votes
1answer
34 views

Fine and Gray model in R with imputed datasets

I have a long (vertically stacked) dataset containing 10 imputations (variable "imputation" identifies imputation number). The imputation was done in SAS but I would like to calculate some c-...
0
votes
1answer
188 views

Multiple imputation in r using “missForest” on categorical variables

I have survey dataset with NAs in several columns. THerefore, I decided to perform multiple imputation using the "missForest" package to impute the missing values. This was not a problem, however I ...
-1
votes
1answer
46 views

Using imputation models created from amelia or mice in R for new data

Suppose I run one of the missing variable imputation R packages, amelia or mice (or similar), on a large data frame -- let's say 100000 rows and 50 columns -- to get imputations for one particular ...
1
vote
1answer
179 views

Is there an R function that performs LASSO regression on multiple imputed datasets and pools results together?

I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much ...
2
votes
0answers
295 views

How to fit and combine submodels into a single stanfit object?

I would appreciate any help to do this: for each P fit the model for each column of weights. do the step 1 for all observations in the dataset to get p * w submodels, where p is the number of ...

http://mssss.yulina-kosm.ru