插补缺失数据,同时强制相关系数保持不变

Mangs

29人浏览 · 2022-09-07 19:29:04

Mangs · 2022-09-07 19:29:04 发布

问题:插补缺失数据,同时强制相关系数保持不变

考虑以下(excel)数据集:

m   |   r
----|------
2.0 | 3.3
0.8 |   
    | 4.0
1.3 |   
2.1 | 5.2
    | 2.3
    | 1.9
2.5 | 
1.2 | 3.0
2.0 | 2.6

我的目标是使用以下条件填写缺失值:

将上述两列之间的成对相关性表示为 R(约 0.68)。将相关性 ** 表示为 R* ** 后** 已填充空单元格。填写表格,使 (R - R*)^2 u003d 0。也就是说,我想保持数据的相关结构不变。

到目前为止,我已经使用 Matlab 完成了它:

clear all;

m = xlsread('data.xlsx','A2:A11') ;
r = xlsread('data.xlsx','B2:B11') ;

rho = corr(m,r,'rows','pairwise');

x0 = [1,1,1,1,1,1];
lb = [0,0,0,0,0,0];
f = @(x)my_correl(x,rho);

SOL = fmincon(f,x0,[],[],[],[],lb)

其中函数my_correl是:

function X = my_correl(x,rho)

sum_m = (11.9 + x(1) + x(2) + x(3));
sum_r = (22.3 + x(1) + x(2) + x(3));
avg_m = (11.9 + x(1) + x(2) + x(3))/8;
avg_r = (22.3 + x(4) + x(5) + x(6))/8;
rho_num = 8*(26.32 + 4*x(1) + 2.3*x(2) + 1.9*x(3) + 0.8*x(4) + 1.3*x(5) + 2.5*x(6)) - sum_m*sum_r;
rho_den = sqrt(8*(22.43 + (4*x(1))^2 + (2.3*x(2))^2 + (1.9*x(3))^2) - sum_m^2)*sqrt(8*(78.6 + (0.8*x(4))^2 + (1.3*x(5))^ + (2.5*x(6))^2) - sum_r^2);

X = (rho - rho_num/rho_den)^2;

end

此函数手动计算相关性,其中每个缺失数据都是一个变量x(i)。

问题:我的实际数据集有超过 20,000 个观察值。 我无法手动创建该 rho 公式。

如何填写我的数据集?

注意 1:我愿意使用 Python、Julia 或 R 等替代语言。Matlab 这只是我的默认语言。

注 2:答题者将获得 100 分奖励。承诺从现在开始。

解答

这就是我将如何处理它,并提供了 R 中的实现:

对缺失数据点的插补没有唯一的解决方案,使得完整(插补)数据的成对相关性等于不完整数据的成对相关性。因此,为了找到一个“好的”解决方案而不仅仅是“任何”解决方案,我们可以引入一个额外的标准,即完整的估算数据也应该与原始数据共享相同的线性回归。这使我们采用了一种相当简单的方法。

计算原始数据的线性回归模型。
找到恰好位于这条回归线上的缺失值的估算值。
为这条回归线周围的估算值生成随机散布的残差
缩放插补残差以强制完整插补数据的相关性等于原始数据的相关性

R中这样的解决方案:

library(data.table)
set.seed(123)

rho = cor(dt$m,dt$r,'pairwise')

# calculate linear regression of original data
fit1 = lm(r ~ m, data=dt)
fit2 = lm(m ~ r, data=dt)
# extract the standard errors of regression intercept (in each m & r direction)
# and multiply s.e. by sqrt(n) to get standard deviation 
sd1 = summary(fit1)$coefficients[1,2] * sqrt(dt[!is.na(r), .N])
sd2 = summary(fit2)$coefficients[1,2] * sqrt(dt[!is.na(m), .N])

# find where data points with missing values lie on the regression line
dt[is.na(r), r.imp := coefficients(fit1)[1] + coefficients(fit1)[2] * m] 
dt[is.na(m), m.imp := coefficients(fit2)[1] + coefficients(fit2)[2] * r]

# generate randomised residuals for the missing data, using the s.d. calculated above
dt[is.na(r), r.ran := rnorm(.N, sd=sd1)] 
dt[is.na(m), m.ran := rnorm(.N, sd=sd2)] 

# function that scales the residuals by a factor x, then calculates how close correlation of imputed data is to that of original data
obj = function(x, dt, rho) {
  dt[, r.comp := r][, m.comp := m]
  dt[is.na(r), r.comp := r.imp + r.ran*x] 
  dt[is.na(m), m.comp := m.imp + m.ran*x] 
  rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
  (rho-rho2)^2
}

# find the value of x that minimises the discrepencay of imputed versus original correlation
fit = optimize(obj, c(-5,5), dt, rho)

x=fit$minimum
dt[, r.comp := r][, m.comp := m]
dt[is.na(r), r.comp := r.imp + r.ran*x] 
dt[is.na(m), m.comp := m.imp + m.ran*x] 
rho2 = cor(dt$m.comp, dt$r.comp,'pairwise')
(rho-rho2)^2  # check that rho2 is approximately equal to rho

作为最后的检查,计算完整估算数据的线性回归并绘制图表以显示回归线与原始数据相同。请注意,下图是针对如下所示的大型数据集,以演示在大型数据上使用此方法。

fit.comp = lm(r.comp ~ m.comp, data=dt)
plot(dt$m.comp, dt$r.comp)
points(dt$m, dt$r, col="red")
abline(fit1, col="green")
abline(fit.comp, col="blue")
mtext(paste(" Rho =", round(rho,5)), at=-1)
mtext(paste(" Rho2 =", round(rho2, 5)), at=6)

在此处输入图像描述

数据

来自 OP 示例的原始玩具数据:

dt=structure(list(m = c(2, 0.8, NA, 1.3, 2.1, NA, NA, 2.5, 1.2, 2), 
                  r = c(3.3, NA, 4, NA, 5.2, 2.3, 1.9, NA, 3, 2.6)), 
             .Names = c("m", "r"), row.names = c(NA, -10L), 
             class = c("data.table", "data.frame"))

一个更大的数据集来展示大数据

dt = data.table(m=rnorm(1e5, 3, 2))[, r:=1.5 + 1.1*m + rnorm(1e5,0,2)]
dt[sample(.N, 3e4), m:=NA]
dt[sample(which(!is.na(m)), 3e4), r := NA]

Python

Python社区为您提供最前沿的新闻资讯和知识内容

更多推荐

求助！为什么用InsCode部署会出现无限重定向？

Python

如何重塑熊猫。系列

问题:如何重塑熊猫。系列在我看来,它就像 pandas.Series 中的一个错误。 a = pd.Series([1,2,3,4]) b = a.reshape(2,2) b b 有类型 Series 但无法显示,最后一条语句给出异常,非常冗长,最后一行是“TypeError: %d format: a number is required, not numpy.ndarray”。 b.sha

Python

在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制]

问题:在哪里可以找到有关 Keras 中默认权重初始化器的文档? [复制] 我刚刚在这里](https://keras.io/initializers/)中阅读了有关[中的 Keras 权重初始化器的信息。在文档中,只介绍了不同的初始化程序。如: model.add(Dense(64, kernel_initializer='random_normal')) 当我没有指定kernel_initia