Data Science Archive 45

Data Science Archive

Yandex 的 NLP 课程资料，这家俄罗斯的公司实力很强，也是 catboost 和 Clickhouse 的东家。
link: https://github.com/yandexdataschool/nlp_course
顺便可以看看托管：https://github.com/yandexdataschool
似乎是他们做的DataScience公开课，值得关注。

GitHub

GitHub - yandexdataschool/nlp_course: YSDA course in Natural Language Processing

YSDA course in Natural Language Processing. Contribute to yandexdataschool/nlp_course development by creating an account on GitHub.

649 views小熊猫, edited 06:13

Data Science Archive

一个 GBM 的实验，比较纯 Python+numba jit 和efficient version histogram binning优化过的 GBT（lightGBM) 的 benchmark。试了一下，貌似 master 分支上的 code 已经相差无几，更新比较活跃。
code: https://github.com/ogrisel/pygbm
关于 numba jit：http://numba.pydata.org/

GitHub

GitHub - ogrisel/pygbm: Experimental Gradient Boosting Machines in Python with numba.

Experimental Gradient Boosting Machines in Python with numba. - ogrisel/pygbm

635 views小熊猫, edited 06:20

Data Science Archive

介绍wasserstein距离的一篇科普文章，深入浅出写得非常好。link：http://www.mindcodec.com/an-intuitive-guide-to-optimal-transport-for-machine-learning/

658 views小熊猫, edited 07:36

Data Science Archive

一个强化学习introductory课程，看了两眼质量还不错，挺系统的，code里面基础RL算法的细节都有涉及，有配套视频，口音还算可以接受。
slides：http://pages.isir.upmc.fr/~sigaud/teach/english.html
code：https://github.com/osigaud/rl_labs_notebooks
视频部分不长，十几分钟的简短介绍。
video：https://www.youtube.com/watch?v=9gzL3QQzvQ4

GitHub

GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

Labs for understanding and coding Standard Reinforcement Learning concepts - GitHub - osigaud/rl_labs_notebooks: Labs for understanding and coding Standard Reinforcement Learning concepts

704 views小熊猫, 08:00

Data Science Archive

介绍 QTE/ATE，以及 Local ATE，来自 Uber Eng，有不少产品角度的数据科学思考。
link: https://eng.uber.com/analyzing-experiment-outcomes/
顺带找到一个知乎上关于 Local ATE 的介绍：https://www.zhihu.com/question/32199571/answer/55792738

735 views小熊猫, edited 08:27

Data Science Archive

一个 ML 扩展包，配合scikit-learn 一起食用还是很不错的，以前用过，主要优势在于 ensemble 和各种常用应用层面的封装，毕竟scikit-learn 里面不常用的方法还是有点多。
link: http://rasbt.github.io/mlxtend/
作者是威斯康辛麦迪逊的统计系老师，也是这本《Python Machine Learning》的作者。
书：https://www.amazon.com/Python-Machine-Learning-Sebastian-Raschka/dp/1783555130

rasbt.github.io

mlxtend

A library consisting of useful tools and extensions for the day-to-day data science tasks.

790 views小熊猫, 17:00

Data Science Archive

一个用 R 做 EDA 的例子，作者来自UChicago。https://angela-li.github.io/slides/2018-11-08/dc-r-presentation#1

angela-li.github.io

Data Science? Make it Spatial

702 views小熊猫, 17:09

Data Science Archive

flexdashboard，可以在 RStudio 里面做交互的可视化插件。如果用 RStudio 的话可以一试，用 Jupyter 似乎不是太需要了。https://blog.rstudio.com/2016/05/17/flexdashboard-easy-interactive-dashboards-for-r/

Rstudio

flexdashboard: Easy interactive dashboards for R

Today we’re excited to announce flexdashboard, a new package that enables you to easily create flexible, attractive, interactive dashboards with R. Authoring and customization of dashboards is done using R Markdown and you can optionally include Shiny components…

706 views小熊猫, edited 17:17

Data Science Archive

一个 ML 系统线上部署以及实战操作部分的工具栈，有模型存储， Data Pipeline，ETL，特征工程，以及各种性能优化，很多工程角度实用的工具收集。
link: https://github.com/EthicalML/awesome-machine-learning-operations
作者也在 EuroScipy 2018上给了一个比较简短的 talk: https://axsauze.github.io/scalable-data-science/#/

GitHub

GitHub - EthicalML/awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version…

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning - EthicalML/awesome-production-machine-learning

706 views小熊猫, 21:41

Data Science Archive

cuDF: GPU DataFrame Library，pandas-like API。貌似 NVIDIA 也有一个类似的项目？但是刚才去找了半天没找到。来自 rapids.ai。
link: https://github.com/rapidsai/cudf
团队还有其他不错的项目，cuML，cuGRAPH，可视化的工具等等，可能是想做一个 GPU Data Science Ecosystem，可以关注一下。
团队主页：https://rapids.ai/
团队项目主页：https://github.com/RAPIDSai

GitHub

GitHub - rapidsai/cudf: cuDF - GPU DataFrame Library

cuDF - GPU DataFrame Library . Contribute to rapidsai/cudf development by creating an account on GitHub.

706 views小熊猫, 22:04

Data Science Archive

XLNI Dataset，和先前 MLNI 差不多类型，不过语言种类更多，但是是它们翻译过来的。这次 Google BERT pre-trained 项目中官方实现的例子里面也有。https://code.fb.com/ai-research/xlni/

Facebook Engineering

Facebook, NYU expand available languages for natural language understanding systems

The XLNI dataset, a collaboration between Facebook and NYU, builds on the MultiNLI corpus, adding 14 languages including low-resource languages.

699 views小熊猫, 22:08

Data Science Archive

一个收集 NLP 各个子领域进展的 markdown 项目，这里对进展的定义不错，都是基于某某公开数据集，以及相应的 metrics，非常适合刚刚入门某个领域。扫了一眼 text classification & summarization，还是比较系统的。遗憾的是对于各个领域独有的（默认的）一些 trick 没有提及。
link: https://github.com/sebastianruder/NLP-progress

GitHub

GitHub - sebastianruder/NLP-progress: Repository to track the progress in Natural Language Processing (NLP), including the datasets…

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks. - sebastianruder/NLP-progress

704 views小熊猫, 22:14

Data Science Archive

EMNLP 2018 上一个非监督的Statistical Machine Translation，WMT14 的 BLEU 分数26.2，还是挺不错的。翻译领域其实不太了解，NMT 还算实践过一些，传统的Statistical MT几乎不太懂。
看了一下项目里的requirements，看到了Moses 的身影，似乎这个是早期传统的 SMT 的重要工具？（上次在一个文言文翻译现代汉语的项目里见到过。
code: https://github.com/artetxem/monoses
link: https://arxiv.org/abs/1809.01272
Moses: http://www.statmt.org/moses/

GitHub

GitHub - artetxem/monoses: Unsupervised Statistical Machine Translation

Unsupervised Statistical Machine Translation. Contribute to artetxem/monoses development by creating an account on GitHub.

718 views小熊猫, 22:23

Data Science Archive

一个用featuretools做特征工程的例子，ft这个工具还不错，上次做Kaggle也有用到，如果是不太熟悉的领域，又是categorical data，先ft提一波高阶组合特征，跑一个baseline还是不错的。
不过这个工具有相当多tricky的参数，时间开销也比较大。
link：https://medium.com/@rrfd/simple-automatic-feature-engineering-using-featuretools-in-python-for-classification-b1308040e183

Medium

Simple Automatic Feature Engineering — Using featuretools in Python for Classification

Preface

734 views小熊猫, edited 04:52

Data Science Archive

一篇快速回顾统计概念的小文，举的例子还是挺不错的，写得也很好。贝叶斯学派和统计学派，虚空假设，Type Error，p-value。
link: https://towardsdatascience.com/statistics-for-people-in-a-hurry-a9613c0ed0b

Medium

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Let me try to grant…

767 views小熊猫, edited 06:34

Data Science Archive

Sebastian Raschka终于写完了他的这套博文系列《Model evaluation, model selection, and algorithm selection in machine learning》的第四章，非常详细地介绍了模型评测部分需要考虑的各种环节，需要一些统计基础。
前三篇连载都是两年前写的，当时看得也是获益匪浅，统计背景比较强的老师看模型和算法的角度会不太一样，非常推荐。
link:
1. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part1.html
2. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part2.html
3. https://sebastianraschka.com/blog/2016/model-evaluation-selection-part3.html
4. https://sebastianraschka.com/blog/2018/model-evaluation-selection-part4.html

Sebastian Raschka, PhD

Model evaluation, model selection, and algorithm selection in machine learning

A single-PDF version of Model Evaluation parts 1-4 is available on arXiv: https://arxiv.org/abs/1811.12808

793 views小熊猫, edited 06:48

Data Science Archive

一键打开 Colab 的Chrome扩展…https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo/related

Google

Open in Colab

Open a Github-hosted notebook in Google Colab

761 views小熊猫, 14:04

Data Science Archive

PCam 一个组织病理学图像的 dataset，量不大，单卡可以用来跑一些 benchmark。似乎这种纹理图片做起来和其他分类可能还是有一些区别，还可以参考一下最近 Kaggle 上的找盐的那场比赛。
link: http://basveeling.nl/posts/pcam/
github: https://github.com/basveeling/pcam

Bas's Blog

PCam: histopathology dataset for fundamental machine learning.

During my work[1] on deep learning models for histopathology, I’ve started to appreciate the tremendous barrier-to-entry that exists for machine learning researchers to evaluate their methods on large medical datasets. This is especially the case for histopathology…

779 views小熊猫, 14:09

Data Science Archive

一个自动画网络结构图的 Python 脚本，除了常见格式，竟然还有 pptx。卷积反卷积，max/ave/global pooling/dense 这些常见的 layer 都能支持。
link: https://github.com/yu4u/convnet-drawer
也是draw_convnet 的姊妹项目。
link: https://github.com/gwding/draw_convnet

GitHub

GitHub - yu4u/convnet-drawer: Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions

Python script for illustrating Convolutional Neural Networks (CNN) using Keras-like model definitions - GitHub - yu4u/convnet-drawer: Python script for illustrating Convolutional Neural Networks (C...

785 views小熊猫, edited 04:10

Data Science Archive

PyCM: 一个 multi-class 混淆矩阵分析的工具，对于特定的分类问题的结果评估也许可以用得上，不过我先前用 scikit-learn 自带的 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html 就基本满足了。看了一下，这个支持的存储类型更为丰富，统计标准也更多。
link: http://www.shaghighi.ir/pycm/
github: https://github.com/sepandhaghighi/pycm

http://www.pycm.ir

PyCM(Python confusion matrix) is a multi-class confusion matrix library in Python.

799 views小熊猫, 04:18

2025/07/08 23:35:22
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.tgoop.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>