# Questions tagged [imputation]

Missing data imputation is the process of replacing missing data with substituted, 'best guess', values. Because missing data can create problems for analyzing data and can lead to missing-data bias, imputation is seen as a way to avoid the problems associated with listwise deletion (ignoring all observations with any missing values).

**0**

votes

**0**answers

28 views

### Pipeline Imputation fail on 1st pass

First, the sample data I am working on:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)), columns =
["col_{}".format(x) for x in range(10)], index = range(0, vsize * 3, 3))
df_2 = ...

**0**

votes

**0**answers

7 views

### Is it right to impute Train and Test set? [on hold]

So I'm experimenting with a Databset , and I had a couple of columns with high cardinality , so I had to perform mean target encoding ( giving that my dataset had more than 50000 observations) , but ...

**0**

votes

**0**answers

14 views

### Extrapolate missing data for grouped by Pandas dataframe

I have a dataset with multiple countries and years. Each country-year combo is a row in the data. Across the columns, there are multiple variables. Some of them have the last few years of data missing ...

**3**

votes

**1**answer

55 views

### variable fillna() in each column

For starters, here is some artificial data fitting my problem:
df = pd.DataFrame(np.random.randint(0, 100, size=(vsize, 10)),
columns = ["col_{}".format(x) for x in range(10)],
...

**1**

vote

**1**answer

38 views

### Pandas & fillna based on groups

I have an interesting problem, which I have fixed on a surface level, but I would like to enhance and improve my implementation.
I have a DataFrame, which holds a dataset for later Machine Learning. ...

**1**

vote

**1**answer

33 views

### NA imputation based on group by and adjacent variable

Data Set
df <- data.frame(ID = c(55, 55, 55, 55, 55, 55, 55, 55, 55, 55,
66, 66, 66, 66, 66, 66, 66, 66, 66, 66),
counter = c(0, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...

**0**

votes

**2**answers

31 views

### How to use parallel computing for missRanger in imputation of missing values?

I am imputing missing values by missRanger and it takes too long as I have 1000 variables. I tried to use parallel computing, but it does not make the process faster. Here is the code
library(...

**-1**

votes

**1**answer

43 views

### replacing NA values with specific averege

i have a data.frame with columns and rows. how could i replace NA values so that it would be the average of the first value before and after that cell in that column?
for example:
1. 1 2 3
2. 4 ...

**0**

votes

**2**answers

29 views

### Descriptive data with mice/miceadds

I have used mice/miceadds to carry out multiple imputation. I am interested in getting a number of descriptive stats on a "pooled dataset"
Question:
1) I want to know the % of values that are above a ...

**-2**

votes

**1**answer

31 views

### Missing value imputaion in python

By doing df.groupby('acc_count', as_index=False)['avg_spd'].median()
I got
acc_count avg_spd
0 20.94
1 24.42
2 26.035
3 ...

**0**

votes

**0**answers

36 views

### How to install knnimpute package on anaconda

I want to install knnimpute package using anaconda,
I'm tried to run
conda install -c kgullikson knnimpute
and the output is:
Solving environment: failed
UnsatisfiableError: The following ...

**0**

votes

**0**answers

17 views

### Restarting R After Imputation

I am using the MICE package to do a multiple imputation on my data. I have 7 variables so the imputation took about 5 hours on my little laptop. This morning, I went back to work but needed to restart ...

**1**

vote

**1**answer

28 views

### How to impute means into specific observations in a column?

I have an assignment at the moment including a table of data that includes information about observations of species of animals being measured on different occasions. In the 'weight' column of my data ...

**0**

votes

**0**answers

24 views

### Bagged tree imputation method not working for big data?

I have a huge data with 1000 variables and 200 subjects. Before starting to run the machine learning algorithm, I have to impute missing values. I use bagged tree imputation method. The problem is ...

**3**

votes

**1**answer

37 views

### Inserting missing rows with imputed values in Python

Problem
How can you insert rows for missing YEARS, with imputed annual SALES.
Progress
The following code computes the sales differences. However, it is for one year, using the explicit iloc ...

**0**

votes

**1**answer

41 views

### Impute different types of variables with MICE

I am trying to perform imputation on a dataset which has 69 columns and over 50000 rows. My dataset has different types of variables:
columns that only present binary variables (0,1)
categorical ...

**1**

vote

**0**answers

45 views

### How to impute data with exclusive binary variables in R?

I have a dataset with 69 columns and over 50000 rows which is structured like this:
Some of the columns can only take 0 or 1 values (binary), for example:'isFemale', 'isChild', etc.
Some other ...

**0**

votes

**1**answer

28 views

### Steps to perform correct data analysis

I have a dataset with 69 columns and 50000 rows.
My dataset only contains binary variables and numerical variables. Moreover, some of the binary variables have some missing values (about 5%).
I know ...

**1**

vote

**0**answers

35 views

### R: Ordered logistic regression with multiple imputation data (amelia package)

I am analyzing data from European Social Survey. Due to quite a bit of missing data I have used the amelia package for imputation. The dependent value is ordinal with 4 values, and I had therefore ...

**0**

votes

**1**answer

20 views

### Replacing Nulls for one Variable based off another

I have a dataset consisting of measured variables and categorical variables based off these measurements. i.e X1 is measured variable and Y1 will either be 0 or 1 based off the measurement in X1.
...

**0**

votes

**1**answer

70 views

### How do I correctly impute these NaN values with modes of another column?

I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.
My data concerns a bicycle-share system and ...

**1**

vote

**1**answer

54 views

### How to replace missing values with group mode in Pandas?

I follow the method in this post to replace missing values with the group mode, but encounter the "IndexError: index out of bounds".
df['SIC'] = df.groupby('CIK').SIC.apply(lambda x: x.fillna(x....

**2**

votes

**2**answers

89 views

### How to deal with NaN values where imputation doesn't make sense? (for PCA)

I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to ...

**1**

vote

**1**answer

184 views

### Grid search CV with sklearn own estimator in python

I am trying to build my own estimator (regressor) and use it for imputation (KnnImputation). I'm having a problem using the grid search "GridSearchCV".
Any ideas what is the problem?
My Code:
class ...

**0**

votes

**0**answers

25 views

### Problem With Imputing DataFame's Columns With MissForest

So I'm trying to impute the columns of a DataFrame , but I get This Error .
(This Is An Imputation For One Specific Column)
from missingpy import MissForest
imputer = MissForest()
...

**2**

votes

**2**answers

89 views

### Impute missing data with mean by group

I have a categorical variable with three levels (A, B, and C).
I also have a continuous variable with some missing values on it.
I would like to replace the NA values with the mean of its group. ...

**0**

votes

**1**answer

28 views

### grid search on own estimator with a continuous target in python

I wrote a KNN imputation implementation and I wanted the StratifiedKFold to check what K and what distance matrix to use.
I got an error: it seems it doesn't recognize my estimator as a regressor (...

**0**

votes

**0**answers

21 views

### Why do I get extremely large ranges using the imputeData() function in mclust package in R?

original.dat <- read_sav(file ="N219.sav")
View(original.dat)
imp.dat <- imputeData(original.dat[,-(129:165)])
View(imp.dat)
I still get an imputed dataset, it's just that the ranges for an ...

**0**

votes

**2**answers

41 views

### Imputation of Missing Values by Categorical Mean?

I have a dataset with several columns, one of which is missing chunks of data that is needed.
The column with missing data, df$Variable, is always attributed to a specific person, df$Name. Is there a ...

**1**

vote

**0**answers

20 views

### knn imputation of all missing values

I have a large data set with a lot of missing values. Some variables up to 30%. Deletion is not an option. What would be the best way to impute?
For KNN, when I run
df_KNN = pd.DataFrame(data=KNN(k=...

**0**

votes

**0**answers

24 views

### Using Pipeline to Avoid Data Leakage both for X and y

I followed the example at this link very closely: https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
but used a different dataset (found here: https://...

**1**

vote

**1**answer

61 views

### Multiple imputation in R (mice) - How do I test imputation runs?

I work with a data set of 171 observations of 55 variables with 35 variables having NA's that I want to impute with the mice function:
imp_Data <- mice(Data,m=5,maxit=50,meth='pmm',seed=500)
...

**0**

votes

**0**answers

36 views

### Import Imputed Data from SAS to R

I am working on a project where I get the imputed data from a colleague who uses SAS and I want to analyze it in R. The problem is that I import it into R as a dataframe using:
final<-read....

**2**

votes

**1**answer

36 views

### How to write a function that imputes missing numeric and character values?

I have the following sample data:
ID GLUC TGL HDL LDL HRT MAMM SMOKE
A 88 NA 32 99 Y NA never
B NA 150 60 NA NA no never
C 110 NA NA 120 N NA NA
D NA 200 65 165 ...

**0**

votes

**1**answer

49 views

### Fill missing value with mean of another variable based on categories in R [duplicate]

I want to replace NA values in val2 in each row with the mean of val corresponding to that ID column. Any easy (tidyverse) way to do this?
Also, I want to know how to replace it by mean(na.rm=TRUE) ...

**1**

vote

**2**answers

104 views

### Generate larger synthetic dataset based on a smaller dataset in Python

I have a dataset with 21000 rows (data samples) and 102 columns (features). I would like to have a larger synthetic dataset generated based on the current dataset, say with 100000 rows, so I can use ...

**0**

votes

**0**answers

36 views

### Replace Impute Resample values in a dataframe column(s) on a condition in Python

I have a time series sensor data. Each column (SENSOR1, SENSOR2, …) has sensor readings values and also ‘Not in Service’ as 'Serv', ‘Fail’, ‘Config’. They needs to be replaced with smth meaningful.
...

**0**

votes

**1**answer

23 views

### per group assign to every start time the latest end time and transport mode that belongs to the highest ID in R

I have a data manipulation problem for which I can solve both imputation individually but not both simultaneously. I have a dataset of tracks which is grouped by ID (different persons), each track has ...

**0**

votes

**1**answer

42 views

### Imputing missing observation

I am analysing a dataset with over 450k rows about 100k rows in one of the columns I am looking at (pa1min_) has NA values, due to non-responses and other random factors. This column deals with ...

**0**

votes

**1**answer

13 views

### R language Amelia specify prefix of output files

This R statement uses the Amelia package to create output data files containing imputed data:
ds.im <- amelia(ds, m=5, p2s=2)
The names of the 5 output files are: output1.csv to output5.csv
In ...

**0**

votes

**1**answer

63 views

### Pandas, replace NaNs with values from MultiIndex DataFrame

Problem
I have a dataframe with some NaNs that I am trying to fill intelligently based off values from another dataframe. I have not found an efficient way to do this but I suspect there is a way ...

**1**

vote

**2**answers

57 views

### Impute missing values in timeseries via bsts

I work with a minutely timeseries with about 20% missing data (in varying lengths).
AFAIK bayesian methods can handle missing data elegantly and I would like to try to fit a bayesian timeseries model ...

**0**

votes

**0**answers

132 views

### NA not permitted in predictors. missForest

I am using missForest in order to impute missing data. I have the data as a data frame and when I put it into the missForest function I get the error:
Error in randomForest.default(x = obsX, y = ...

**0**

votes

**0**answers

36 views

### SAS proc mianalyze EDF

We have a cluster randomized trial with a small number of clusters, The primary endpoint is measured at follow-up and we have missing data. We proposed to conduct a linear mixed model including ...

**3**

votes

**2**answers

56 views

### Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
For example, say I have the dataframe ...

**-1**

votes

**1**answer

34 views

### Fine and Gray model in R with imputed datasets

I have a long (vertically stacked) dataset containing 10 imputations (variable "imputation" identifies imputation number). The imputation was done in SAS but I would like to calculate some c-...

**0**

votes

**1**answer

188 views

### Multiple imputation in r using “missForest” on categorical variables

I have survey dataset with NAs in several columns. THerefore, I decided to perform multiple imputation using the "missForest" package to impute the missing values. This was not a problem, however I ...

**-1**

votes

**1**answer

46 views

### Using imputation models created from amelia or mice in R for new data

Suppose I run one of the missing variable imputation R packages, amelia or mice (or similar), on a large data frame -- let's say 100000 rows and 50 columns -- to get imputations for one particular ...

**1**

vote

**1**answer

179 views

### Is there an R function that performs LASSO regression on multiple imputed datasets and pools results together?

I have a dataset with 283 observation of 60 variables. My outcome variable is dichotomous (Diagnosis) and can be either of two diseases. I am comparing two types of diseases that often show much ...

**2**

votes

**0**answers

295 views

### How to fit and combine submodels into a single stanfit object?

I would appreciate any help to do this:
for each P fit the model for each column of weights.
do the step 1 for all observations in the dataset to get p * w submodels, where p is the number of ...