AIBARSAI工具資訊

第三階段：數據與特徵工程

專門收集機器學習特徵工程技術資源的精選列表，涵蓋數值、文本、圖像、分類、時間序列等多種數據類型的特徵工程方法和工具

Start Learning AI →

FeatureEngineeringMachineLearningDataScienceGitHubTextFreeEnglish

Awesome 特徵工程專案介紹

專案概述

Awesome 特徵工程 是一個專門收集機器學習特徵工程技術資源的精選列表。該專案由 Andrei Khobnia 維護，遵循 Creative Commons Attribution-Noncommercial-ShareAlike 3.0 Unported License 許可協議。

該專案為機器學習從業者提供了一個全面的特徵工程技術資源庫，涵蓋了不同資料類型的特徵工程方法和工具。

主要內容分類

1. 數值資料 (Numeric Data)

資料轉換：
- Box-Cox 轉換：scipy.stats.boxcox
- 對數轉換：np.log (x + const)
自動化特徵工程：
- Featuretools：用於自動化特徵工程
特徵互動：
- sklearn.preprocessing.PolynomialFeatures：多項式特徵生成
- 除法運算
- 其他互動式特徵

2. 文字資料 (Textual Data)

詞袋模型：
- Bag-of-words model
- A Gentle Introduction to the Bag-of-Words Model
- sklearn.feature_extraction.text.CountVectorizer
- sklearn.feature_extraction.DictVectorizer
- sklearn.feature_extraction.FeatureHasher
詞嵌入技術：
特徵提取技術：
- ClearTK - Feature Extraction Tutorial
- 正規表達式
- Part-of-Speech_Tagging
- NLTK Categorizing and Tagging Words

3. 影像資料 (Image Data)

傳統特徵提取：
深度學習特徵提取：
- Keras pre-trained models feature extraction
- Using Keras' Pre-trained Models for Feature Extraction in Image Clustering

4. 分類資料 (Categorical Data)

獨熱編碼：
- Why One-Hot Encode Data in Machine Learning?
- How to One Hot Encode Sequence Data in Python
- sklearn.preprocessing.OneHotEncoder
- Keras - to_categorical
目標編碼：
特徵雜湊：

5. 時間序列資料 (Time Series Data)

自動特徵提取：
- Automatic extraction of relevant features from time series
- Basic Feature Engineering With Time Series Data in Python

6. 地理空間資料 (Geospatial Data)

包含地理位置相關的特徵工程技術

專案特色

全面性：涵蓋了機器學習中主要的資料類型和相應的特徵工程技術
實用性：提供了具體的工具庫和程式碼實作
開源性：採用開源許可協議，歡迎社群貢獻
權威性：連結到權威的文件、教學和學術資源
可操作性：提供了具體的Python函式庫和函式呼叫方法

使用價值

該專案對以下人群特別有價值：

機器學習工程師
資料科學家
特徵工程研究人員
機器學習初學者
希望提升模型效能的從業者

貢獻方式

專案鼓勵社群貢獻，可以透過建立 pull requests 來新增資源或改進現有內容。

總結

Awesome 特徵工程專案為機器學習特徵工程提供了一個全面而實用的資源庫，是學習和應用特徵工程技術的重要參考資料。透過系統化的分類和豐富的資源連結，幫助從業者快速找到適合特定資料類型的特徵工程方法。