伦理网站_数据收集的伦理_python_weixin

伦理网站

So you’re ready to collect some data and start modeling, but how can you be sure that your data has been ethically sourced?

因此，您已经准备好收集一些数据并开始建模，但是如何确定您的数据是符合道德的呢？

CW: I talk about mental health and suicide prevention in a section below.

CW：我在下面的部分中谈论心理健康和自杀预防。

当前的数据保护格局 (The Current Data Protection Landscape)

The Health Insurance Portability and Accountability Act, known as HIPAA, was passed in 1996 in order to protect sensitive and identifying personal health data after medical treatment. The goal was a strict “need to know” sharing of medical data unless the patient signed a consent form for a particular usage. There were some exceptions in the interest of the “common good”, including gunshot and stab wounds, crime-related injuries, possible abuse cases, and infectious diseases.

1996年通过了《健康保险携带和责任法案》(HIPAA)，以保护经过治疗的敏感和识别个人健康数据。目标是严格共享医疗数据，除非患者签署了特定用途的同意书。为了“共同利益”，有一些例外，包括枪伤和刺伤，与犯罪有关的伤害，可能的虐待案件和传染病。

A later supplement, the Omnibus Final Rule of 2013, updated HIPAA to include heavier financial penalties for organizations violating the law, patients’ rights to access electronic information, and including genetic data in HIPAA protected territory. As Dr. Weisse notes, while complete control over access to personal medical records is the “holy grail of privacy rights advocates”, our current systems of medical administration and insurance make this impossible.

后来的补充内容是《 2013年综合总线最终规则》，它对HIPAA进行了更新，其中包括对违反法律的组织施加更严厉的经济处罚，患者获取电子信息的权利以及在HIPAA保护地区包括遗传数据。正如Weisse博士所指出的那样，尽管完全控制对个人医疗记录的访问是“隐私权倡导者的圣杯”，但我们当前的医疗管理和保险体系使这成为不可能。

While these laws are necessary and work in theory, in practice they have led to great confusion on both sides of the patient-physician boundary. Additionally, like much of the legislation governing emerging technologies (see facial recognition, Siri always listening, …), it is woefully inadequate for effectively covering technologies not yet established or imagined.

尽管这些定律是必要的，并且在理论上起作用，但实际上，它们导致了医患边界两侧的极大混乱。此外，就像许多管理新兴技术的法规一样(请参阅面部识别，Siri总是在听，...)，该法规严重不足以有效涵盖尚未建立或想象的技术。

The recent European Union legislation, General Data Protection Regulation (GDPR), goes much further in protecting personal data. There has been much discussion over the efficacy of the law, but there is no doubt that it is one of the most stringent data protection laws in the world. Unlike HIPAA or other US data protection laws, GDPR requires organizations to use the highest possible privacy settings by default and limits data usage to six classes, including consent given, vital interest, and legal requirement.

欧盟最近的法律，《通用数据保护条例》(GDPR)，在保护个人数据方面走得更远。关于法律效力的讨论很多，但是毫无疑问，它是世界上最严格的数据保护法律之一。与HIPAA或其他美国数据保护法律不同，GDPR要求组织默认使用最高的隐私设置，并将数据使用限制为六类，包括给予同意，切身利益和法律要求。

Furthermore, no data can be collected until explicit consent for that purpose has been given and that consent can be retracted at any time. This means that one Terms of Service agreement cannot give a company free-reign over a user’s data indefinitely. Organizations that violate the GDPR are heavily fined, up to 20 million euros or 4% of the previous year’s total revenue. As an example, British Airways was fined 183 million pounds after poor security led to a skimming attack targeting 500,000 of its users.

此外，只有在为此目的获得明确同意并且任何时候都可以撤回同意之前，才能收集任何数据。这意味着一项服务条款协议不能使公司无限期地自由控制用户的数据。违反GDPR的组织将被处以最高2000万欧元的罚款，占上一年度总收入的4％。例如，英国航空公司因安全性差导致针对其50万用户的掠夺攻击而被罚款1.83亿英镑。

这些措施不足之处 (Where These Measures Fall Short)

Facebook的自杀算法 (Facebook’s Suicide Algorithm)

In 2017, Facebook began scraping user’s social media content without consent in order to build a Suicide Prevention tool after a series of live-streamed suicides. Outside of the non-consensual collection, one would think that assessments of mental health, depression, and suicidal ideation would be classified as sensitive health information, right? Well, according to HIPAA, because Facebook is not a healthcare organization they are not subject to the field’s regulations.

2017年，Facebook开始在未经许可的情况下抓取用户的社交媒体内容，以在一系列直播自杀事件之后构建自杀预防工具。在非自愿收集之外，人们会认为对心理健康，抑郁和自杀观念的评估将被归类为敏感的健康信息，对吗？好吧，据HIPAA称，因为Facebook不是医疗保健组织，所以它们不受该领域法规的约束。

This is a clear, yet at the time understandable, miss. When HIPAA was written it seemed reasonable that only healthcare organizations would have access to these personal health identifiers (PHIs). With the advent of sophisticated artificial intelligence and endless-resource tech giants, private, non-healthcare organizations are now attempting to innovate in the medical field without direct oversight.

小姐，这很明显，但在当时还是可以理解的。撰写HIPAA时，只有医疗机构才能访问这些个人健康标识符(PHI)，这似乎是合理的。随着先进的人工智能技术和无尽资源的技术巨人的到来，私营非医疗保健组织现在正试图在医疗领域进行创新，而无需直接监督。

What makes the implications concrete are the 3,500 cases of Facebook contacting law enforcement after their system flagged a user as suicidal. In one case, law enforcement even sent the user’s personal information to the New York Times, a clear breach of privacy.

导致具体含义的是在其系统将用户标记为自杀后的3500例Facebook与执法机构联系的案件。在一个案例中，执法部门甚至将用户的个人信息发送至《纽约时报》，这明显违反了隐私权。

The European Union’s GDPR effectively banned Facebook’s collection methods as explicit permission is required from users in order to collect mental health information. While Facebook’s program does have the potential for good, the next steps for its ethical, effective use are ambiguous.

欧盟的GDPR有效地禁止了Facebook的收集方法，因为需要用户的明确许可才能收集心理健康信息。尽管Facebook的程序确实具有良好的潜力，但其道德，有效使用的下一步措施尚不明确。

23andMe的遗传数据 (23andMe’s Genetic Data)

Another case of regulatory under-sight is the popular genetic and ancestry testing company 23andMe — again not subject to HIPAA — and their selling of users’ genetic information to pharmaceutical companies. There are potential risks for insurance companies using user’s genetic data to identify pre-existing conditions before any symptoms emerge. This practice was outlawed in some situations and for health insurance specifically, but not for life or disability insurance.

监管监督的另一个案例是流行的遗传和祖先检测公司23andMe(同样不受HIPAA约束)及其将用户的遗传信息出售给制药公司。保险公司使用用户的遗传数据来在出现任何症状之前识别先前存在的状况存在潜在的风险。在某些情况下，特别是对于健康保险，这种做法是违法的，但对于人寿或伤残保险，则是非法的。

Some ethically ambiguous have already emerged from this practice. One example is Huntington’s Disease, a late-onset, brain disorder controlled by a single defective gene. The Huntington’s Disease Society of America has an entire guide on choosing whether to get genetically tested because while technically illegal for insurance companies to utilize, there is always a potential risk that this information could be misused.

这种做法已经出现了一些道德上的歧义。一个例子是亨廷顿舞蹈病，一种由单个缺陷基因控制的迟发性脑部疾病。美国亨廷顿舞蹈病学会对选择是否进行基因测试提供了完整的指导，因为尽管从技术上讲保险公司不得使用该信息，但始终存在滥用该信息的潜在风险。

未来与你 (The Future and You)

As technology continues to stride forward, we will inevitably hear more stories of regulation misses. It is vital that governments remain up-to-date with the implications of emerging innovations, and how to protect citizens’ data privacy in a world increasingly devoid of it.

随着技术的不断进步，我们不可避免地会听到更多有关法规缺失的故事。至关重要的是，政府必须与时俱进，了解新兴创新的影响，以及如何在日益缺乏这种创新的世界中保护公民的数据隐私。

As a data scientist, you must be cognizant of how your data is collected and utilized. Here’s a great set of questions to ask of yourself and your model.

作为数据科学家，您必须了解如何收集和利用您的数据。这是关于您自己和您的模型的一系列问题。

Here’s a shortlist:

这是一个候选清单：

Consent: users must give explicit consent for each and every new usage of their personal data. This is a legal dependency in some jurisdictions, but a good practice in all cases.
同意：用户必须对他们的个人数据的每一种新用法都明确表示同意。在某些司法管辖区中，这是法律上的依存关系，但在所有情况下都是一种良好做法。
Transparency: especially in cases with concrete repercussions, can you explain how your model and data process is arriving at a decision?
透明度 ：特别是在有具体影响的情况下，您能否解释一下您的模型和数据流程是如何做出决定的？
Accountability: evaluate the potential harm of a model and work to limit said harm. What is the potential for the model to be misinterpreted, both in good and bad faith?
问责制 ：评估模型的潜在危害，并努力限制这种危害。无论是善意还是恶意，都可能对模型产生误解？
Anonymity: how will a user’s identifying information be protected throughout all stages of the data science process? Who, at any point, has access to this data? Does identifying data even need to be in the dataset? If not, remove it.
匿名性 ：在数据科学过程的所有阶段中，如何保护用户的标识信息？谁在任何时候都可以访问此数据？识别数据是否甚至需要在数据集中？如果没有，请将其删除。
Bias: what steps have been taken to understand the potential bias in a data set? Could even missing values be a proxy for bias? See Redlining.
偏差：采取了哪些步骤来了解数据集中的潜在偏差？甚至可能缺少值也可以代表偏见吗？请参阅Redlining 。