Unsupervised Alignment of Privacy Policies using Hidden Markov Models

Rohan Ramanath  Fei Liu  Norman Sadeh  Noah A. Smith
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213, USA

To support empirical study of online privacy policies, as well as tools for users with privacy concerns, we consider the problem of aligning sections of a thousand policy documents, based on the issues they address. We apply an unsupervised HMM; in two new (and reusable) evaluations, we find the approach more effective than clustering and topic models.



1 Introduction

Privacy policy documents are verbose, often esoteric legal documents that many people encounter as clients of companies that provide services on the web. McDonald and Cranor (2008) showed that, if users were to read the privacy policies of every website they access during the course of a year, they would end up spending a substantial amount of their time doing just that and would often still not be able to answer basic questions about what these policies really say. Unsurprisingly, many people do not read them [9].

Such policies therefore offer an excellent opportunity for NLP tools that summarize or extract key information that (i) helps users understand the implications of agreeing to these policies and (ii) helps legal analysts understand the contents of these policies and make recommendations on how they can be improved or made more clear. Past applications of NLP have sought to parse privacy policies into machine-readable representations [5] or extract sub-policies from larger documents [14]. Machine learning has been applied to assess certain attributes of policies [7, 1, 6, 15].

This paper instead analyzes policies in aggregate, seeking to align sections of policies. This task is motivated by an expectation that many policies will address similar issues,11Personal communication, Joel Reidenberg. such as collection of a user’s contact, location, health, and financial information, sharing with third parties, and deletion of data. This expectation is supported by recommendation by privacy experts [10] and policymakers [9]; in the financial services sector, the Gramm-Leach-Bliley Act requires these institutions to address a specific set of issues. Aligning policy sections is a first step toward our aforementioned summarization and extraction goals.

We present the following contributions:

  • A new corpus of over 1,000 privacy policies gathered from widely used websites, manually segmented into subtitled sections by crowdworkers (§2).

  • An unsupervised approach to aligning the policy sections based on the issues they discuss. For example, sections that discuss “user data on the company’s server” should be grouped together. The approach is inspired by the application of hidden Markov models to sequence alignment in computational biology (Durbin et al., 1998; §3).

  • Two reusable evaluation benchmarks for the resulting alignment of policy sections (§4). We demonstrate that our approach outperforms naïve methods (§5).

Our corpus and benchmarks are available at http://usableprivacy.org/data.

2 Data Collection

We collected 1,010 unique privacy policy documents from the top websites ranked by Alexa.com.22http://www.alexa.com These policies were collected during a period of six weeks during December 2013 and January 2014. They are a snapshot of privacy policies of mainstream websites covering fifteen of Alexa.com’s seventeen categories (Table 1).33The “Adult” category was excluded; the “World” category was excluded since it contains mainly popular websites in different languages, and we opted to focus on policies in English in this first stage of research, though mulitlingual policy analysis presents interesting challenges for future work.

Business Computers Games Health
Home News Recreation Shopping
Arts Kids and Teens Reference Regional
Science Society Sports
Table 1: Fifteen website categories provided by Alexa.com. We collect privacy policies from the top 100 websites in each.

Finding a website’s policy is not trivial. Though many well-regulated commercial websites provide a “privacy” link on their homepages, not all do. We found university websites to be exceptionally unlikely to provide such a link. Even once the policy’s URL is identified, extracting the text presents the usual challenges associated with scraping documents from the web. Since every site is different in its placement of the document (e.g., buried deep within the website, distributed across several pages, or mingled together with Terms of Service) and format (e.g., HTML, PDF, etc.), and since we wish to preserve as much document structure as possible (e.g., section labels), full automation was not a viable solution.

We therefore crowdsourced the privacy policy document collection using Amazon Mechanical Turk. For each website, we created a HIT in which a worker was asked to copy and paste the following privacy policy-related information into text boxes: (i) privacy policy URL; (ii) last updated date (or effective date) of the current privacy policy; (iii) privacy policy full text; and (iv) the section subtitles in the top-most layer of the privacy policy. To identify the privacy policy URL, workers were encouraged to go to the website and search for the privacy link. Alternatively, they could form a search query using the website name and “privacy policy” (e.g., “Amazon.com privacy policy”) and search in the returned results for the most appropriate privacy policy URL. Given the privacy policy full text and the section subtitles, we partition the full privacy document into different sections, delimited by the section subtitles. A privacy policy is then converted into XML.

Each HIT was completed by three workers, paid $0.05, for a total cost of $380 (including Amazon’s surcharge).

Websites with Unique privacy Unique privacy Ave. sections Ave. tokens
Category privacy URL policies policies w/ date per policy per policy
Arts 94 80 72 11.1 (± 3.8) 2894 (± 1815)
Business 100 95 75 10.1 (± 4.9) 2531 (± 1562)
Computers 100 78 62 10.7 (± 4.9) 2535 (± 1763)
Games 92 80 51 10.2 (± 4.9) 2662 (± 2267)
Health 92 86 57 10.0 (± 4.4) 2325 (± 1891)
Home 100 84 68 11.5 (± 3.8) 2493 (± 1405)
Kids and Teens 96 86 62 10.3 (± 4.5) 2683 (± 1979)
News 96 91 68 10.7 (± 3.9) 2588 (± 2493)
Recreation 98 97 67 11.9 (± 4.5) 2678 (± 1421)
Reference 84 86 55 9.9 (± 4.1) 2002 (± 1454)
Regional 98 91 72 11.2 (± 4.2) 2557 (± 1359)
Science 71 75 49 9.2 (± 4.1) 1705 (± 1136)
Shopping 100 99 84 12.0 (± 4.1) 2683 (± 1154)
Society 96 94 65 10.2 (± 4.6) 2505 (± 1587)
Sports 96 62 38 10.9 (± 4.0) 2222 (± 1241)
Average 94.2 85.6 63.0 10.7 (± 4.3) 2471 (± 1635)
Table 2: Statistics of each website category, including (i) the number of websites with an identified privacy policy link; (ii) number of unique privacy policies in each category (note that in rare cases, multiple unique privacy policies were identified for the same website, e.g., a website that contains links to both new and old versions of its privacy policy); (iii) number of websites with an identified privacy modification date; (iv) average number of sections per policy; (v) average number of tokens per policy.

3 Approach

Given the corpus of privacy policies described in §2, we designed a model to efficiently infer an alignment of policy sections. While we expect that different kinds of websites will likely address different privacy issues, we believe that many policies will discuss roughly the same set of issues. Aligning the policies is a first step in a larger effort to (i) automatically analyze policies to make them less opaque to users and (ii) support legal experts who wish to characterize the state of privacy online and make recommendations [7, 1, 6].

We are inspired by multiple sequence alignment methods in computational biology [8] and by Barzilay and Lee (2004), who described a hidden Markov model (HMM) for document content where each state corresponds to a distinct topic and generates sentences relevant to that topic according to a language model. We estimate an HMM-like model on our corpus, exploiting similarity across privacy policies to the extent it is evident in the data. In our formulation, each hidden state corresponds to an issue or topic, characterized by a distribution over words and bigrams appearing in privacy policy sections addressing that issue. The transition distribution captures tendencies of privacy policy authors to organize these sections in similar orders, though with some variation.

The generative story for our model is as follows. Let 𝒮 denote the set of hidden states.

  1. 1.

    Choose a start state y1 from 𝒮 according to the start-state distribution.

  2. 2.

    For t=1,2,, until yt is the stopping state:

    1. (a)

      Sample the tth section of the document by drawing a bag of terms, 𝒐t, according to the emission multinomial distribution for state yt. Note the difference from traditional HMMs, in which a single observation symbol is drawn at each time step. 𝒐t is generated by repeatedly sampling from a distribution over terms that includes all unigrams and bigrams except those that occur in fewer than 5% of the documents and in more than 98% of the documents. This filtering rule was designed to eliminate uninformative stopwords as well as company-specific terms (e.g., the name of the company).44The emission distributions are not a proper language models (e.g., a bigram may be generated by as many as three draws from the emission distribution: once for each unigram it contains and once for the bigram).

    2. (b)

      Sample the next state, yt+1, according to the transition distribution over 𝒮.

This model can nearly be understood as a hidden semi-Markov model [3], though we treat the section lengths as observable. Indeed, our model does not even generate these lengths, since doing so would force the states to “explain” the length of each section, not just its content. The likelihood function for the model is shown in Figure 1.

Figure 1: The likelihood function for the alignment model (one privacy policy). yt is the hidden state for the tth section, 𝒐t is the bag of unigram and bigram terms observed in that section, and t is the size of the bag. Start-state, emission, and transition distributions are denoted respectively by π, η, and γ. yn+1 is the silent stopping state.

The parameters of the model are almost identical to those of a classic HMM (start state distribution, emission distributions, and transition distributions), except that emissions are characterized by multinomial rather than a categorical distributions. These are learned using Expectation-Maximization, with a forward-backward algorithm to calculate marginals (E-step) and smoothed maximum likelihood estimation for the M-step [13]. After learning, the most probable assignment of a policy’s sections to states can be recovered using a variant of the Viterbi algorithm.

We consider three HMM variants. “Vanilla” allows all transitions. The other two posit an ordering on the states 𝒮={s1,s2,,sK}, and restrict the set of transitions that are possible, imposing bias on the learner. “All Forward” only allows sk to transition to {sk,sk+1,,sK}. “Strict Forward” only allows sk to transition to sk or sk+1.

4 Evaluation

Developing a gold-standard alignment of privacy policies would either require an interface that allows each annotator to interact with the entire corpus of previously aligned documents while reading the one she is annotating, or the definition (and likely iterative refinement) of a set of categories for manually labeling policy sections. These were too costly for us to consider, so we instead propose two generic methods to evaluate models for sequence alignment of a collection of documents with generally similar content. Though our model (particularly the restricted variants) treats the problem as one of alignment, our evaluations consider groupings of policy sections. In the sequel, a grouping on a set X is defined as a collection of subsets XiX; these may overlap (i.e., there might be xXiXj) and need not be exhaustive (i.e., there might be xXiXi).

4.1 Evaluation by Human QA

This study was carried out as part of a larger collaboration with legal scholars who study privacy. In that work, we have formulated a set of nine multiple choice questions about a single policy that ask about collection of contact, location, health, and financial information, sharing of each with third parties, and deletion of data.55The questions are available in an online appendix at http://usableprivacy.org/data. The questions were inspired primarily by the substantive interest of these domain experts—not by this particular algorithmic study.

For thirty policies, we obtained answers from each of six domain experts who were not involved in designing the questions. For the purposes of this study, the experts’ answers are not important. In addition to answering each question for each policy, we also asked each expert to copy and paste the text of the policy that contains the answer. Experts were allowed to select as many sections for each question as they saw fit, since answering some questions may require synthesizing information from different sections.

For each of the nine questions, we take the union of all policy sections that contain text selected by any annotator as support for her answer. This results in nine groups of policy sections, which we call answer-sets denoted A1,,A9. Our method allows these to overlap (63% of the sections in any Ai occurred in more than one Ai), and they are not exhaustive (since many sections of the policies were not deemed to contain answers to any of the nine questions by any expert).

Together, these can be used as a gold standard grouping of policy sections, against which we can compare our system’s output. To do this, we define the set of section pairs that are grouped together in answer sets, G=|{a,bAia,b}|, and a similar set of pairs H from a model’s grouping. From these sets, we calculate estimates of precision (|GH|/|H|) and recall (|GH|/|G|).

One shortcoming of this approach, for which the second evaluation seeks to compensate, is that a very small, and likely biased, subset of the policy sections is considered.

4.2 Evaluation by Direct Judgment

We created a separate gold standard of judgments of pairs of privacy policy sections. The data selected for judgment was a sample of pairs stratified by a simple measure of text similarity. We derived unigram tfidf vectors for each section in each of 50 randomly sampled policies per category. We then binned pairs of sections by cosine similarity (into four bins bounded by 0.25, 0.5, and 0.75). We sampled 994 section pairs uniformly across the 15 categories’ four bins each.

Crowdsourcing was used to determine, for each pair, whether the two sections should be grouped together. A HIT consisted of a pair of policy sections and a multiple choice question, “After reading the two sections given below, would you say that they broadly discuss the same topic?” The possible answers were:

  1. 1.

    Yes, both the sections essentially convey the same message in a privacy policy.

  2. 2.

    Although, the sections do not convey the same message, the broadly discuss the same topic. (For ease of understanding, some examples of content on “the same topic” were included.)

  3. 3.

    No, the sections discuss two different topics.

The first two options were considered a “yes” for the majority voting and for defining a gold standard. Every section-pair was annotated by at least three annotators (as many as 15, increased until an absolute majority was reached). Turkers with an acceptance rate greater than 95% with an experience of at least 100 HITs were allowed and paid $0.03 per annotation. The total cost including some initial trials was $130. 535 out of the 994 pairs were annotated to be similar in topic. An example is shown in Figure 2.

As in §4.1, we calculate precision and recall on pairs. This does not penalize the model for grouping together a “no” pair; we chose it nonetheless because it is interpretable.

Section 5 of classmates.com:
[46 words] … You may also be required to use a password to access certain pages on the Services where certain types of your personal information can be changed or deleted. … [113 words]
Section 2 of 192.com:
[50 words] … This Policy sets out the means by which You can have Your Personal Information removed from the Service. 192.com is also committed to keeping Personal Information of users of the Service secure and only to use it for the purposes set out in this Policy and as agreed by You. … [24 words]
Figure 2: Selections from sections that discuss the issue of “deletion of personal information” and were labeled as discussing the same issue by crowdworkers. Both naïve grouping and LDA put them in two different groups, but the Strict Forward variant of our model correctly groups them together.

5 Experiment

Precision Recall F1
Mean S.D. Mean S.D. Mean S.D.
Clust. 0.63 0.30 0.40
LDA 0.56 0.03 0.20 0.05 0.29 0.06
Vanilla 0.62 0.04 0.41 0.04 0.49 0.03
All F. 0.63 0.03 0.47 0.12 0.53 0.06
Strict F. 0.62 0.05 0.46 0.18 0.51 0.07
Clust. 0.62 0.23 0.34
LDA 0.57 0.03 0.18 0.01 0.28 0.02
Vanilla 0.57 0.01 0.30 0.03 0.39 0.02
All F. 0.58 0.02 0.32 0.06 0.41 0.04
Strict F. 0.58 0.03 0.32 0.14 0.40 0.08
Table 3: Evaluation by human QA (above) and direct judgment (below), aggregated across ten independent runs where appropriate (see text). Vanilla, All F(orward), and Strict F(orward) are three variants of our HMM.

In this section, we evaluate the three HMM variants described in §3, and two baselines, using the methods in §4. All of the methods require the specification of the number of groups or hidden states, which we fix to ten, the average number of sections per policy.


Our first baseline is a greedy divisive clustering algorithm66As implemented in CLUTO, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview to partition the policy sections into ten clusters. In this method, the desired K-way clustering solution is computed by performing a sequence of bisections. The implementation uses unigram features and cosine similarity. Our second baseline is latent Dirichlet allocation (LDA; Blei et al., 2003), with ten topics and online variational Bayes for inference [11].77As implemented in gensim [16]. To more closely match our models, LDA is given access to the same unigram and bigram tokens.


Table 3 shows the results. For LDA and the HMM variants (which use random initialization), we report mean and standard deviation across ten independent runs. All three variants of the HMM improve over the baselines on both tasks, in terms of F1. In the human QA evaluation, this is mostly due to recall improvements (i.e., more pairs of sections relevant to the same policy question were grouped together).

The three variants of the model performed similarly on average, though Strict Forward had very high variance. Its maximum performance across ten runs was very high (67% and 53% F1 on the two tasks), suggesting the potential benefits of good initialization or model selection.

6 Conclusion

We considered the task of aligning sections of a collection of roughly similarly-structured legal documents, based on the issues they address. We introduced an unsupervised model for this task along with two new (and reusable) evaluations. Our experiments show the approach to be more effective than clustering and topic models. The corpus and evaluation data have been made available at http://usableprivacy.org/data . In future work, policy section alignments will be used in automated analysis to extract useful information for users and privacy scholars.


The authors gratefully acknowledge helpful comments from Lorrie Cranor, Joel Reidenberg, Florian Schaub, and several anonymous reviewers. This research was supported by NSF grant SaTC-1330596.


  • [1] W. Ammar, S. Wilson, N. Sadeh and N. A. Smith(2012) Automatic categorization of privacy policies: A pilot study. Technical report Technical Report CMU-LTI-12-019, Carnegie Mellon University. Cited by: 1, 3.
  • [2] R. Barzilay and L. Lee(2004) Catching the drift: probabilistic content models, with applications to generation and summarization. Cited by: 3.
  • [3] L. E. Baum and T. Petrie(1966) Statistical inference for probabilistic functions of finite state Markov chains. Annals of Mathematical Statistics 37, pp. 1554–1563. Cited by: 3.
  • [4] D. M. Blei, A. Y. Ng and M. I. Jordan(2003) Latent Dirichlet allocation. the Journal of machine Learning research 3, pp. 993–1022. Cited by: 5.
  • [5] C. A. Brodie, C. Karat and J. Karat(2006) An empirical study of natural language parsing of privacy policy rules using the SPARCLE policy workbench. Cited by: 1.
  • [6] E. Costante, J. Hartog and M. Petković(2013) What websites know about you. in R. Pietro, J. Herranz, E. Damiani and R. State (Eds.), Data Privacy Management and Autonomous Spontaneous Security, Lecture Notes in Computer Science, Vol. 7731, pp. 146–159. External Links: ISBN 978-3-642-35889-0, Document, Link Cited by: 1, 3.
  • [7] E. Costante, Y. Sun, M. Petković and J. den Hartog(2012) A machine learning solution to assess privacy policy completeness. Cited by: 1, 3.
  • [8] R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison(1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press. Cited by: 1, 3.
  • [9] Federal Trade Commission(2012) Protecting consumer privacy in an era of rapid change: recommendations for businesses and policymakers. External Links: Link Cited by: 1, 1.
  • [10] R. Gellman(2014) Fair information practices: a basic history (v. 2.11). Note: Available at \urlhttp://www.bobgellman.com/rg-docs/rg-FIPShistory.pdf Cited by: 1.
  • [11] M. D. Hoffman, D. M. Blei and F. R. Bach(2010) Online learning for latent Dirichlet allocation. Cited by: 5.
  • [12] A. M. McDonald and L. F. Cranor(2008) The cost of reading privacy policies. I/S: A Journal of Law and Policy for the Information Society 4 (3). Cited by: 1.
  • [13] L. Rabiner(1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257–286. Cited by: 3.
  • [14] X. Xiao, A. Paradkar, S. Thummalapenta and T. Xie(2012) Automated extraction of security policies from natural-language software documents. Cited by: 1.
  • [15] S. Zimmeck and S. M. Bellovin(2013) Machine learning for privacy policy. Cited by: 1.
  • [16] R. Řehůřek and P. Sojka(2010) Software framework for topic modelling with large corpora. Cited by: 5.