英语训练

记录为了保研面试读的英文文章


🔬 A fork() in the road

The received wisdom suggests that Unix's unusual combination of fork() and exec() for process creation was an inspired design. In this paper, we argue that fork was a clever hack for machines and programs of the 1970s that has long outlived its usefulness and is now a liability. We catalog the ways in which fork is a terrible abstraction for the modern programmer to use, describe how it compromises OS implementations, and propose alternatives.

As the designers and implementers of operating systems, we should acknowledge that fork's continued existence as a first-class OS primitive holds back systems research, and deprecate it. As educators, we should teach fork as a historical artifact, and not the first process creation mechanism students encounter.

When the designers of Unix needed a mechanism to create processes, they added a peculiar new system call: fork(). As every undergraduate now learns, fork creates a new process identical to its parent (the caller of fork), with the exception of the system call's return value. The Unix idiom of fork() followed by exec() to execute a different program in the child is now well understood, but still stands in stark contrast to process creation in systems developed independently of Unix [e.g., 1, 30, 33, 54].

50 years later, fork remains the default process creation API on POSIX: Atlidakis et al. [8] found 1304 Ubuntu packages (7.2% of the total) calling fork, compared to only 41 uses of the more modern posix_spawn(). Fork is used by almost every Unix shell, major web and database servers (e.g., Apache, PostgreSQL, and Oracle), Google Chrome, the Redis key-value store, and even Node.js. The received wisdom appears to hold that fork is a good design. Every OS textbook we reviewed [4, 7, 9, 35, 75, 78] covered fork in uncritical or positive terms, often noting its "simplicity" compared to alternatives. Students today are taught that "the fork system call is one of Unix's great ideas" [46] and "there are lots of ways to design APIs for process creation; however, the combination of fork() and exec() are simple and immensely powerful . . . the Unix designers simply got it right" [7].

Our goal is to set the record straight. Fork is an anachronism: a relic from another era that is out of place in modern systems where it has a pernicious and detrimental impact. As a community, our familiarity with fork can blind us to its faults (§4). Generally acknowledged problems with fork include that it is not thread-safe, it is inefficient and unscalable, and it introduces security concerns. Beyond these limitations, fork has lost its classic simplicity; it today impacts all the other operating system abstractions with which it was once orthogonal. Moreover, a fundamental challenge with fork is that, since it conflates the process and the address space in which it runs, fork is hostile to user-mode implementation of OS functionality, breaking everything from buffered IO to kernel-bypass networking. Perhaps most problematically, fork doesn't compose—every layer of a system from the kernel to the smallest user-mode library must support it.

We illustrate the havoc fork wreaks on OS implementations using our experiences with prior research systems (§5). Fork limits the ability of OS researchers and developers to innovate because any new abstraction must be special-cased for it. Systems that support fork and exec efficiently are forced to duplicate per-process state lazily. This encourages the centralisation of state, a major problem for systems not structured using monolithic kernels. On the other hand, research systems that avoid implementing fork are unable to run the enormous body of software that uses it.

We end with a discussion of alternatives (§6) and a call to action (§7): fork should be removed as a first-class primitive of our systems, and replaced with good-enough emulation for legacy applications. It is not enough to add new primitives to the OS—fork must be removed from the kernel.

📖 核心词汇 (点击卡片可翻转)
The received wisdom
noun phrase

公认的观点;传统观念
conventional or widely accepted opinion or belief

The received wisdom
inspired
adj.

有灵感的;巧妙的
showing creativity or having a brilliant idea

inspired
outlived
v.

比...活得久;过时了
to continue to exist longer than something else

outlived
liability
n.

负担;累赘
something that causes problems or disadvantages

liability
catalog
v.

编目;列举
to make a systematic list of items

catalog
holds back
phrasal v.

阻碍;妨碍
to prevent progress or development

holds back
deprecate
v.

反对;不赞成
to disapprove of or express criticism

deprecate
artifact
n.

人工制品;遗迹
an object made by humans, typically of historical interest

artifact
peculiar
adj.

奇特的;独特的
strange or odd; distinctive

peculiar
with the exception of
prep. phrase

除了...之外
apart from; excluding

with the exception of
idiom
n.

惯用法;习语
a characteristic mode of expression

idiom
stand in stark contrast
verb phrase

形成鲜明对比
to be very different from something else

stand in stark contrast
uncritical
adj.

不加批判的;盲从的
not expressing criticism or careful judgment

uncritical
simplicity
n.

简单性;简洁
the quality of being easy to understand or do

simplicity
immensely
adv.

极其;非常
to a great extent; extremely

immensely
set the record straight
idiom

澄清事实;纠正误解
to correct misinformation or misconceptions

set the record straight
anachronism
n.

时代错误;过时的事物
something that is out of its proper time

anachronism
relic
n.

遗物;残留物
an object surviving from an earlier time

relic
pernicious
adj.

有害的;恶性的
having a harmful effect in a gradual way

pernicious
detrimental
adj.

有害的;不利的
tending to cause harm

detrimental
familiarity
n.

熟悉;亲近
close acquaintance with or knowledge of something

familiarity
orthogonal
adj.

正交的;独立的
independent; not affecting each other

orthogonal
conflate
v.

合并;混合
to combine two or more things into one

conflate
hostile
adj.

敌对的;不利的
unfriendly; opposed

hostile
functionality
n.

功能性;实用性
the quality of being suited to serve a purpose well

functionality
havoc
n.

大破坏;混乱
widespread destruction or chaos

havoc
wreak
v.

造成;引起
to cause something bad to happen

wreak
special-case
v.

特殊处理
to handle as an exception or special situation

special-cased
centralisation
n.

集中化
the process of concentrating control in a central authority

centralisation
monolithic
adj.

单体的;整体的
formed of a single large block; unified

monolithic
legacy
adj.

遗留的;传统的
relating to old or outdated computer systems

legacy

🔍 Extracting Training Data from Large Language Models

It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.

We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.

We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

Language models (LMs)—statistical models which assign a probability to a sequence of words—are fundamental to many natural language processing tasks. Modern neural-network-based LMs use very large model architectures (e.g., 175 billion parameters) and train on massive datasets (e.g., nearly a terabyte of English text). This scaling increases the ability of LMs to generate fluent natural language, and also allows them to be applied to a plethora of other tasks, even without updating their parameters.

At the same time, machine learning models are notorious for exposing information about their (potentially private) training data—both in general and in the specific case of language models. For instance, for certain models it is known that adversaries can apply membership inference attacks to predict whether or not any particular example was in the training data.

Such privacy leakage is typically associated with overfitting—when a model's training error is significantly lower than its test error—because overfitting often indicates that a model has memorized examples from its training set. Indeed, overfitting is a sufficient condition for privacy leakage and many attacks work by exploiting overfitting.

The association between overfitting and memorization has erroneously—led many to assume that state-of-the-art LMs will not leak information about their training data. Because these models are often trained on massive de-duplicated datasets only for a single epoch, they exhibit little to no overfitting. Accordingly, the prevailing wisdom has been that "the degree of copying with respect to any given work is likely to be, at most, de minimis" and that models do not significantly memorize any particular training example.

Contributions. In this work, we demonstrate that large language models memorize and leak individual training examples. In particular, we propose a simple and efficient method for extracting verbatim sequences from a language model's training set using only black-box query access. Our key insight is that, although training examples do not have noticeably lower losses than test examples on average, certain worst-case training examples are indeed memorized.

In our attack, we first generate a large, diverse set of high-likelihood samples from the model, using one of three general-purpose sampling strategies. We then sort each sample using one of six different metrics that estimate the likelihood of each sample using a separate reference model (e.g., another LM), and rank highest the samples with an abnormally high likelihood ratio between the two models.

Our attacks directly apply to any language model, including those trained on sensitive and non-public data. We use the GPT-2 model released by OpenAI as a representative language model in our experiments. We choose to attack GPT-2 to minimize real-world harm—the GPT-2 model and original training data source are already public.

To make our results quantitative, we define a testable definition of memorization. We then generate 1,800 candidate memorized samples, 100 under each of the 3 × 6 attack configurations, and find that over 600 of them are verbatim samples from the GPT-2 training data (confirmed in collaboration with the creators of GPT-2). In the best attack configuration, 67% of candidate samples are verbatim training examples. Our most obviously-sensitive attack extracts the full name, physical address, email address, phone number, and fax number of an individual (see Figure 1). We comprehensively analyze our attack, including studying how model size and string frequency affects memorization, as well as how different attack configurations change the types of extracted data.

We conclude by discussing numerous practical strategies to mitigate privacy leakage. For example, differentially-private training is theoretically well-founded and guaranteed to produce private models if applied at an appropriate record level, but it can result in longer training times and typically degrades utility. We also make recommendations, such as carefully de-duplicating documents, that empirically will help to mitigate memorization but cannot prevent all attacks.

📖 核心词汇 (点击卡片可翻转)
scrapes
n.

网络爬取的数据
data collected by automated web crawling

scrapes
verbatim
adv.

逐字地;原文地
in exactly the same words; word for word

verbatim
IRC
n.

网络聊天协议
Internet Relay Chat protocol

IRC
UUIDs
n.

通用唯一标识符
Universally Unique Identifiers

UUIDs
drawing lessons
verb phrase

总结经验;吸取教训
learning from experience or analysis

drawing lessons
fluent
adj.

流畅的;自然的
smooth and natural in expression

fluent
a plethora of
phrase

大量的;过多的
a large or excessive amount of

a plethora of
notorious
adj.

臭名昭著的;以...著称的
famous for something bad

notorious
Indeed
adv.

确实;实际上
used to emphasize a statement

Indeed
sufficient condition
noun phrase

充分条件
a condition that guarantees a result

sufficient condition
de-duplicated
adj.

去重的
with duplicate content removed

de-duplicated
with respect to
prep. phrase

关于;就...而言
concerning or regarding

with respect to
high-likelihood samples
noun phrase

高概率样本
samples with high probability from the model

high-likelihood samples
quantitative
adj.

定量的;数量的
measured by quantity rather than quality

quantitative
differentially-private training
noun phrase

差分隐私训练
training method that protects individual privacy

differentially-private training
well-founded
adj.

有根据的;基础扎实的
based on good reasons or evidence

well-founded
empirically
adv.

经验上;实证地
based on observation or experience

empirically