Data Mining and Machine Learning

L1.Information Retrieval（信息检索)
（Focus on Text Retrieval）定义：大量文本集中找出和查询语句相关性最强的信息。

## 第一章

Moore’s Law:Technology performance doubles and prices halve every 18 months.
Disk capacity：106=1MB,109 tytes=1GB,10^12 bytes=1TB;
1TB能存储多少东西？Whast can you store on 1TB？

1TB=10^12;

Python计算脚本如下：

``````Python 3.8.0 (tags/v3.8.0:fa919fd, Oct 14 2019, 19:37:50) [MSC v.1916 64 bit (AMD64)] on win32
>>> a = 16000*16
>>> b = 10**12*8
>>> days = b/(a*60*60*24)
>>> print(days)
361.68981481481484
>>>``````

2.Why we store these big corpora? it’s huge expense.

What’s the problems?
This is a problem in semantics.

For example：
I saw the man on the hill with the telescope？
（What’s the real）

Databases Retrieval有如下特点：
First.数据特诊：数据有属性、结构、数据之间相关性、查询和人类似。
Second.结构化，严格的查询。
Third.数据需要及时更行。
Fourth.具体的查询给出具体的结果。

Mining：Degging deep into the earth, to find hidden, valuable materials.

Data Mining: Analysis of large data corpora.Corpora which are too large for human inspection.

Information Retrieval Components

``````Documents: Identify words which are ‘important’ for discriminating between documents, and how important they are.
Index: Specifies the relationships between these ‘keywords’ and the documents.
The query
Matching: Measuring the similarity between the query and each dovument.
Retrieved documents
Assessment and Relevance Feedback.``````

Words（关键字，some words are more important）
Sentences（Grammar/Syntax）词按一定顺序的集合助于理解和消除歧义。