Lecture 13 Introduction to Web¶
Basic Items¶
(1) Web (World Wide Web): A collection of data and services
(2) The web is not the Internet
- The Internet describes how data is transported between servers and browsers
- We will study the Internet later in the networking unit
(3) Elements of the Web
- URLs: uniquely identify a piece of data on the web
- HTTP: the standard for how web browsers communicate with web servers
- Data on a webpage can contain:
- HTML: A markup language for creating webpages
- CSS: A style sheet language for defining the appearance of webpages
- JavaScript: A programming language for running code in the web browse
URLs¶
URL (Uniform Resource Locator): A string that uniquely identifies one piece of data on the web
A URL contains:
- Scheme
- Domain
- Location
- Path
- Query
- Fragment
We will introduce them one by one :)
Scheme¶
Scheme 方案 / 协议
- Located just before the double slashes 在双斜线前面
- Defines how to retrieve the data over the Internet (namely, which Internet protocol to use)
- Common schemes:
http
(unencrypted) orhttps
(secure, encrypted)
Text Only | |
---|---|
1 |
|
Here, https
is the scheme, it uses the https protocol to retrieve.
Domain¶
Domain 域名
- Located after the double slashes, but before the next single slash 恰在双斜线后面,但在下一个斜线前面
- Defines which web server to contact
- Written as several phrases separated by dots 用点号连接
Text Only | |
---|---|
1 |
|
Here, toon.cs161.org
is the domain.
Location¶
Location 定位资源在哪个地方 (server-level)
Location: The domain with some additional information
There are 2 mode of location:
- Username:
evanbot@cs161.org
- Identifies one specific user on the web server
- Rarely seen
- Port:
toon.cs161.org:4000
- Identifies one specific application on the web server
- We will see ports again in the networking unit
Text Only | |
---|---|
1 |
|
Here, toon.cs161.org:4000
is the location.
Path¶
Path 定位资源在server的哪个路径 (file-level)
- Located after the first single slash 在第一个单斜线后面
- Defines which file on the web server to fetch
- Think of the web server as having its own filesystem
- The path represents a file path on the web server's filesystem
- Examples
https://toon.cs161.org/xorcist/avian.html
: Look in the xorcist folder foravian.html
https://toon.cs161.org/
: Return the root directory/
Query¶
Query 传参查询
- Providing a query is optional
- Located after a question mark 在问号后面
- Supplies arguments to the web server for processing
- Think of the web server as offering a function at a given path
- To access this function, a user makes a request to the path, with some arguments in the query
- The web server runs the function with the user's arguments and returns the result to the user
- Form: Arguments are supplied as
name=value
pairs, separated with ampersands (&
)
Text Only | |
---|---|
1 |
|
Here, character=evan&size=big
is the query, means: I need the draw function, its size equals to big
and character equals to evan
.
Fragment¶
Fragment 局部的导航标志: 不与服务器交互,而是告诉浏览器如何在页面内定位特定内容或者向 JavaScript 代码传递参数
- Providing a fragment is optional
- Located after a hash sign (
#
) 在井号后面 - Not sent to the web server! Only used by the web browser
- Common usage: Tells the web browser to scroll to a part of a webpage
- Usage: Supplies content to code in the web browser (JavaScript) without sending the content to the server
URL Escaping
URLs have special characters (?, #, /)
What if we want to use a special character in the URL?
Solution: URL encoding (URL解码)
- Notation: Percent sign (%) followed by the hexadecimal value of the character
- Example:
%20 = ' '
(spacebar)%35 = '#'
(hash sign)%50 = '2'
(printable characters can also be encoded)
It will raise some security issues: makes scanning for malicious URLs harder
We will talk about this later
Summary of URL¶
HTTP¶
- HTTP (Hypertext Transfer Protocol): A protocol used to request and retrieve data from a web server
- HTTPS: A secure version of HTTP
- HTTP is a request-response model
- Web Browser sends a request to a Web Server
- Web Server processes the request and sends a response back to the Web Browser
Components of HTTP Request¶
- URL Path (maybe contains query parameters)
- Method:
GET
: "get" info from the server, don't change server-side statePOST
: "post" info to the server, update server-side state
- Data:
GET
Requests do not contain any dataPOST
Requests can contain data
Components of HTTP Response¶
- Status Code: indicating what happened with the request
- 200: OK
- 403: Access Forbidden
- 404: Page Not Found
- Data:
- can be a web-page / image / PDF ...
Parts of a Webpage¶
- HTML: A markup language to create structured documents
- CSS: A style sheet language
- JavaScript: running code in web browser
- client-side: run in browser, not server
- manipulate HTML and CSS: more interactive
How to render a webpage
- Browser receives HTML, CSS and JavaScript from Server
- HTML and CSS are parsed into a DOM
- JavaScript is interpreted and executed, possibly modifying the DOM
- The painter uses the DOM to draw the webpage
DOM
DOM (Document Object Model):
- Cross-platform Model for representing and interacting with objects in HTML
- Cross-platform and language-independent interface that treats an XML or HTML document as a tree structure
- Each node in this tree has a
tag
/attributes
/child nodes
Risks on the Web¶
- Risk #1: Web servers should be protected from unauthorized access
- Protection: Server-side security
- Risk #2: A malicious website should not be able to damage our computer
- Protection: Sandboxing
- JS is not allowed to access files on our computer
- Review: Privilege Seperation / Least Privilege
- Protection: Sandboxing
- Risk #3: A malicious website should not be able to tamper with our interactions with other websites
- Same-Origin Policy: Web Browser prevents a webpage from accessing data other unrelated websites
Sandboxing
Web 开发中,Sandboxing(沙盒化) 是一种安全机制,用于限制运行代码的执行范围,以防止恶意或未经授权的操作。这种机制通常用于保护浏览器或应用免受潜在的安全威胁,确保代码只能在受控的、隔离的环境中运行。
Same-Origin Policy¶
Same-Origin Policy: A rule that prevents one website from tampering with another website
- Trait: Enforced by Web Browser
- Principle: Two webpages have the same origin if and only if the protocol, domain, and port of the URL all match exactly
If no port is specified, the default is 80 for HTTP and 443 for HTTPS