Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions analyses/2009_03_noahwalugembe__Internet-Jones-evolution.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
Verification / evolution of "Internet Jones" paper #26
Buy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Buy
By

Walugembe Francis Noah
noahwalugembe@gmail.com
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to put your contact information in the analysis. But I think it's okay if you do (ping @mlopatka to confirm)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong preference either way.

The license applied to this particular work is inherited from the (parent) overscripted repo. However if
Walugembe Francis Noah would like to ensure visible attribution of this specific PR in this form, I see no problem with the inclusion of contact information here.

That said, it is NOT a requirement of the outreachy application process to associate contact information in the PR like this and it may be worth considering the public visibility of this information.


Introduction

Third-party web tracking is the practice by which entities (“trackers”) embedded in webpages re-identify users as they browse the web, collecting information about the websites that they visit. A cording to According to Lerner, Simpson, Kohno and Roesner, (2016) web Tracking is typically done for the purposes of website analytics, targeted advertising, and other forms of personalization (e.g., social media content). In this work I am evaluating the contribution of "Internet Jones" paper #26 starting with its insight on TrackingExcavator and a longitudinal measurement study of third-party cookie-based web tracking on Wayback Machine1. I will also show how has the third-party web tracking ecosystem evolved since its beginnings according to "Internet Jones" paper.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great introduction and I appreciate you clearly laying out background and the goal of your analysis.

In this work I am evaluating the contribution of "Internet Jones" paper #26 starting with...

One small thing: #26 refers to the issue not the paper - joining the two together like this doesn't make sense. You could omit the #26 so it reads In this work I am evaluating the contribution of "Internet Jones" paper starting with...

One bigger thing: To evaluate the Internet Jones paper was not the intention of #26. The goal of #26 was to apply applicable methodologies from the Internet Jones paper to the OverScripted dataset to compare the results we see with the results that were presented in the Internet Jones paper.


TrackingExcavator

The Wayback Machine1 contains archives of full webpages, including JavaScript, stylesheets, and embedded resources, dating back to 1996. To leverage this archive, According to Lerner, Simpson, Kohno and Roesner, (2016) designed and implemented a retrospective tracking detection and analysis platform called TrackingExcavator which allowed them to conduct a longitudinal study of third-party tracking from 1996 to present (2016). TrackingExcavator logs in-browser behaviors related to web tracking, including: third-party requests, cookies attached to requests, cookies programmatically set by JavaScript, and the use of other relevantJavaScript APIs (e.g., HTML5 LocalStorage and APIsused in browser fingerprinting, such as enumerating installed plugins). TrackingExcavator also run on both live as well as archived versions of websites.

Wayback Machine

According to Lerner, Simpson, Kohno and Roesner, (2016)
it was discovered that The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use but they stated that Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously archived for other purposes which is true because it is a good approach to start from some thing than reinventing from scratch. At this point I am going to mention some of the failures identified by According to Lerner, Simpson, Kohno and Roesner, (2016)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are quoting work it is very important to use quotation marks and citations to make it clear that you have used the original authors words. Alternatively you can re-write evidence / claims / conclusions in your own words. Here is the passage from the original paper that is too close, in my eyes, for you to claim as your own words.

To be fair, you do say they state that Nevertheless... but this needs to be they state that "nevertheless...

Additionally, your use of "they state that" may imply that earlier parts of the sentence are your own words.

Pg 7 - Sec 4

The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web
tracking and is thus imperfect for that use. Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously
archived for other purposes.

.
The researchers realized that the Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time. As shown in Table 2, missing archives are rare. The Wayback Machine’s archived pages execute the corresponding archived JavaScript within the browser when TrackingExcavator visits them, the Wayback Machine does not execute JavaScript during its archival crawls of the web. Instead, it attempts to statically extract URLs from HTML and JavaScript to find additional sites to archive. It then modifies the archived JavaScript, rewriting the URLs in the included script to point to the archived copy of the resource. This process may fail, particularly for dynamically generated URLs. As a result, when TrackingExcavator visits archived pages, dynamically generated URLs not properly redirected to their archived versions will cause the page to attempt to make a request to the live web, i.e., “escape” the archive. TrackingExcavator blocks such escapes (see Section 3). As a result, the script never runs on the archived site, never sets a cookie or leaks it, and thus TrackingExcavator does not witness the associated tracking behavior. Also embedded resources in a webpage archived by the Wayback Machine may occasionally have a timestamp far from the timestamp of the top-level page. Any of the above failures can lead to cascading failures, in that non-archived responses or blocked requests will result in the omission of any subsequent requests or cookie setting events that would have resulted from the success of the original request. The “wake” of a single failure cannot be measured within an archival dataset, because events following that failure are simply missing. To study the effect of these cascading failures, we must compare an archival run to a live run from the same time; we do so in the next subsection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, when you are quoting work it is very important to use quotation marks and citations to make it clear that you have used the original authors words. Alternatively you can re-write conclusions in your own words. Here are passages from the original paper that are too close, in my eyes, for you to claim as your own words:

Page8 Sec 4.1

The Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time. As shown in Table 2, missing archives are rare.

Though the Wayback Machine’s archived pages execute the corresponding archived JavaScript within the browser when TrackingExcavator visits them, the Wayback Machine does not execute JavaScript during its archival crawls of the web. Instead, it attempts to statically extract URLs from HTML and JavaScript to find additional sites
to archive. It then modifies the archived JavaScript, rewriting the URLs in the included script to point to the archived copy of the resource. This process may fail, particularly for dynamically generated URLs. As a result, when TrackingExcavator visits archived pages, dynamically generated URLs not properly redirected to their archived versions will cause the page to attempt to make a request to the live web, i.e., “escape” the archive. TrackingExcavator blocks such escapes (see Section 3). As a result, the script never runs on the archived site, never sets a cookie or leaks it, and thus TrackingExcavator does not witness the associated tracking behavior.

As others have documented [10], embedded resources in a webpage archived by the Wayback Machine may occasionally have a timestamp far from the timestamp of the top-level page.

Any of the above failures can lead to cascading failures, in that non-archived responses or blocked requests will result in the omission of any subsequent requests or cookie setting events that would have resulted from the success of the original request. The “wake” of a single failure cannot be measured within an archival dataset, because events following that failure are simply missing. To study the effect of these cascading
failures, we must compare an archival run to a live run from the same time; we do so in the next subsection.

longitudinal measurement study.

After evaluating the Wayback Machine’s view into the past and developing best practices for using its data, we use TrackingExcavator to conduct a longitudinal study of the third-party web tracking ecosystem from 1996- 2016. the researchers explored how this ecosystem has changed over time, including the prevalence of different web tracking behaviors, the identities and scope of popular trackers, and the complexity of relationships within the ecosystem. Among their findings, they identified the earliest tracker in the dataset of 1996 and observe the rise and fall of important players in the ecosystem (e.g., the rise of Google Analytics to appear on over a third of all popular websites). They also found that websites contact an increasing number of third parties over time (about 5% of the 500 most popular sites contacted at least 5 separate third parties in early 2000s, whereas nearly 40% do so in 2016) and that the top trackers can track users across an increasing percentage of the web’s most popular sites. They also found out that tracking behaviors changed over time, e.g., that third-party popups peaked in the mid-2000s and that the fraction of trackers that rely on referrals from other trackers has recently risen
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again passages from the paper:

Page 3, Sec 1

After evaluating the Wayback Machine’s view into the past and developing best practices for using its data, we use TrackingExcavator to conduct a longitudinal study of the third-party web tracking ecosystem from 1996-2016 (Sections 5). We explore how this ecosystem has changed over time, including the prevalence of different web tracking behaviors, the identities and scope of popular trackers, and the complexity of relationships within the ecosystem. Among our findings, we identify the earliest tracker in our dataset in 1996 and observe the rise and fall of important players in the ecosystem (e.g., the rise of Google Analytics to appear on over a third of all popular websites). We find that websites contact an increasing number of third parties over time (about 5% of the 500 most popular sites contacted at least 5 separate third parties in early 2000s, whereas nearly 40% do so in 2016) and that the top trackers can track users across an increasing percentage of the web’s most popular sites. We also find that tracking behaviors changed over time, e.g., that third-party popups peaked in the mid-2000s and that the fraction of trackers that rely on referrals from other trackers has recently risen.

When taking such direct quotes. It's not appropriate to replace "we" with "they" or "the researchers" as it builds the impression that these words are your own.


Conclusion

Taken together, the Internet Jones" paper #26 research findings show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem— suggesting that users’ and policymakers’ concerns about privacy require sustained, and perhaps increasing, attention. The Internet Jones" paper #26 research results also provide hitherto unavailable historical context for today’s technical and policy discussions. It is also stated in the Internet Jones" paper #26 research that Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party webtracking and is thus imperfect for that use.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page 3, Sec 1

Taken together, our findings show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem — suggesting that users’ and policy-makers’ concerns about privacy require sustained, and perhaps increasing, attention. Our results provide hitherto unavailable historical context for today’s technical and policy discussions.

Page 7, Sec 4

The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use.

Again, in this case, you phrase it It is also stated in the Internet Jones" paper #26 research that Wayback Machine..... This is getting closer to attribution, but it's not clear where the quote starts and ends. You could say The Internet Jones paper notes that "the Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web-tracking and is thus imperfect for that use."

A further improvement on this would be to use the convention of referring to the authors rather than the paper e.g. Lerner et al. note that "the Wayback Machine....


Reference

Lerner A., Simpson A. K., Kohno T., and Roesner F.,(2016). Internet Jones and the Raiders of the lost trackers: An archaeological study of web tracking from 1996 to 2016. University of Washington. Retrieved from https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/lerner