自然语言处理

本人简介：曼彻斯特大学研究生，目前就职于北京某车企，专注于AI包括不限于NLP & CV技术，助力工业落地项目，有合作或者比赛联系可以联系[email protected]

本项目主要是日常的有关于NLP基础的介绍 & 原理 & 面经 & 经验 & 框架 & 应用 , 欢迎补充。

序号	类别	项目名称	简介	地址
1	原理	N-gram	每个词出现的概率只取决于前面n - 1个单词的
2	原理	word2vec	词向量	jmlr.csail.mit.edu/papers/volume3/bengio03a/bengio03a.pdf
3	原理	NPLM	神经概率语言模型	bengio03a.dvi (mit.edu)
4	原理	seq2seq	端到端的神经网络	Sequence to Sequence Learning with Neural Networks (arxiv.org)
5	原理	attention	注意力机制	Attention Is All You Need (arxiv.org)
6	模型架构	Transformer	变形金刚，大模型基础结构	Attention Is All You Need (arxiv.org)
7	模型架构	GPT	GPT-3	GPT-3: Its Nature, Scope, Limits, and Consequences
8	模型架构	chatGPT	如何让GPT - 》 chatGPT	Training language models to follow instructions with human feedback (arxiv.org)
9	训练框架	DeepSpeed	微软推出的提供了一站式的快速以及大规模的训练及推理框架，目前使用最广泛的训练框架	microsoft/DeepSpeed: DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. (github.com)
10	应用	Openai	openai接口	ChatGPT
11	模型架构	BERT	自编码器-唯一的架构，初期使用最广泛的语言模型	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (arxiv.org)
12	微调	Lora	目前广泛使用的微调技术，通过冻结模型而训练增加adapter，从而达到微调适配下游任务的目的	LoRA: Low-Rank Adaptation of Large Language Models (arxiv.org)
13	微调	Ptuning_V1	利用在下游任务中前置添加若干个可更新参数的虚拟[tokens] 所构成的模板prompt 再输入到文本中	GPT Understands, Too (arxiv.org)
13	微调	Ptuning_V2	在V1的基础上，通过构造训练一个少量参数的prompt-encoder(lstm+mlp) 构建无真实语义信息的 virtual token	P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks (arxiv.org)
14	推理框架	Xinference	性能强大且功能全面的分布式推理框架	xorbitsai/inference: Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop. (github.com)
15	RAG	naive-RAG	检索召回技术，目前最主流减少大模型幻觉的落地技术	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arxiv.org)
16	RAG	advanced-RAG	增强的检索召回技术思路
17	RAG	RAGAS	RAG性能评估框架	explodinggradients/ragas: Supercharge Your LLM Application Evaluations 🚀 (github.com)
1w	微调	QLora	量化版的Lora微调技术	artidoro/qlora: QLoRA: Efficient Finetuning of Quantized LLMs (github.com)
19	微调	Lora & QLora	lora & QLora 微调技巧	lightning.ai/pages/community/tutorial/lora-llm/
20	RAG框架	Dify	工业主流RAG & agent 框架部署指南	langgenius/dify: Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production. (github.com)
21	模型架构	LLama1	META公司推出的SOTA 开源大模型	https://arxiv.org/pdf/2302.13971.pdf
22	模型架构	LLama2	META公司推出的SOTA 开源大模型	2302.13971 (arxiv.org)
23	模型架构	Mistral 7B	Mistral AI公司推出的第一个基座大模型	2401.04088 (arxiv.org)
24	原理	layerNorm	为什么在NLP领域中普遍用LayerNorm 而不是BatchNorm
25	微调	SFT-trick	微调技术的一些小技巧
26	模型架构	Mistral 8x7B	第一个知名的MOE架构的大模型	2401.04088 (arxiv.org)
27	微调	Parameter	TrainingArguments参数设置
28	模型架构	LLama3	META公司推出的SOTA 开源大模型	meta-llama/llama3: The official Meta Llama 3 GitHub site
29	应用	gpt	一些好用的gpt套壳网站
30	微调	trainer	transformer.Trainer参数设置
31	RAG	advanced-RAG	优化的检索召回技术思路
32	RAG	HippoRAG	HippoRAG结合了大型语言模型（LLMs）、知识图谱和个性化PageRank算法	HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models
33	RAG	FILCORAG	过滤内容增强的RAG技术	Learning to Filter Context for Retrieval-Augmented Generation (arxiv.org)
34	微调	lora VS finetuning	到底是应该选择Lora还是选择全量微调？	LoRA Learns Less and Forgets Less
35	评测	MTEB	embedding模型性能评测榜单	MTEB: Massive Text Embedding Benchmark (arxiv.org)](https://arxiv.org/abs/2210.07316)
36	RAG	HyKGERAG	北大结合知识图谱的意料RAG	HyKGE: A Hypothesis Knowledge Graph Enhanced Framework for Accurate and Reliable Medical LLMs Responses
37	RAG	RAFT	该技术通过结合相关文档的检索和模型的微调，从而提升模型在特定领域内的推理能力	RAFT: Adapting Language Model to Domain Specific RAG (arxiv.org)
38	模型架构	Timsfm	专为时间序列预测设计的解码器通用大基础模型	A decoder-only foundation model for time-series forecasting (arxiv.org)
39	评测	CEval	全面评估中文环境下基础模型能力的解决方案	2305.08322
40	prompt工程	prompt	如何与大模型交流--prompt工程	Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4
41	原理	parameter	解读大模型的参数
42	RAG	ClashEval	但当检索到的内容存在错误或有害信息时，模型会优先同意召回的信息而不是大模型本身信息	ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence
43	RAG	VisualRAG	视觉+RAGpipeline
45	RAG	GraphRAG	微软开源的知识图谱+ RAG	microsoft/graphrag: A modular graph-based Retrieval-Augmented Generation (RAG) system
46	RAG	GraphRAG	GraphRAG快速入门	microsoft/graphrag: A modular graph-based Retrieval-Augmented Generation (RAG) system
47	微调	prompt-tuning	prompt tuing & instruction tuning & chain-of-though三者区别
48	微调	DDP	分布式训练
49	面经	Lora面经	Lora面经
50	模型架构	OLMoE	第一个开源MOE大模型	2409.02060
51	面经	langchain面经	langchain面经
52	RAG	longCite	助力大模型找到长文本引用	THUDM/LongCite: LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA
53	推理框架	Ollama	大模型部署框架	ollama/ollama: Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.
54	训练框架	Llama-factory	一站式中文训练模型框架	hiyouga/LLaMA-Factory: Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
55	RAG框架	FastGPT	FastGPT 是一个基于 LLM 大语言模型的知识库问答系统，提供开箱即用的数据处理、模型调用等能力	labring/FastGPT: FastGPT is a knowledge-based platform built on the LLMs, offers a comprehensive suite of out-of-the-box capabilities such as data processing, RAG retrieval, and visual AI workflow orchestration, letting you easily develop and deploy complex question-answering systems without the need for extensive setup or configuration.
56	面经	RAG	RAG面经
57	原理	RoPE	旋转位置编码原理详解
58	原理	LLM	从0->1构建自己的大模型	如何从头训练大语言模型: A simple technical report - 知乎
59	评测	evaluate	LLM评估指南如何从头训练大语言模型: A simple technical report - 知乎
60	RAG	text2vec	如何选择chunksize 和splitter	HuixiangDou/README_zh.md at main · InternLM/HuixiangDou
61	RAG	embedding	微调embedding
62	RAG	KAG	KAG 旨在充分利用知识图谱和向量检索的优势，并通过四个方面双向增强大型语言模型和知识图谱，以解决 RAG 挑战	OpenSPG/KAG: KAG is a knowledge-enhanced generation framework based on OpenSPG engine, which is used to build knowledge-enhanced rigorous decision-making and information retrieval knowledge services
63	模型架构	Qwen2	阿里千问Qwen2系列	QwenLM/Qwen2.5: Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
64	模型架构	Qwen2代码	Qwen2代码解析
65	文档解析	MinerU	免费精准解析PDF文档的开源解决方案
66	prompt工程	prompt	prompt 工程合辑
67	prompt工程	Promptim	LangChain推出自动化提示优化工具Promptim：一键优化，效率倍增LangChain推出自动化提示优化工具Promptim：一键优化，效率倍增	hinthornw/promptimizer: Prompt optimization scratch
68	prompt工程	De_Ai	一键去除Ai味
69	推理框架	vllm	大模型集群分部署部署框架
70	Langchain	Langchain_Summary	Langchain的长文本总结处理方式	vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
71	RAG	Table_RAG
72	RAG	LazyGraphRAG	微软重磅推出高性价比下一代GraphRAG










Todo	RAG	Open Parse	提取PDF文档文字、表格混排自动识别	Filimoa/open-parse: Improved file parsing for LLM’s (github.com)

目前，检索增强生成（RAG）系统成为了将海量知识赋能于大模型的关键技术之一。然而,如何高效地处理半结构化和非结构化数据，尤其是文档中的表格数据，仍然是 RAG 系统面临的一大难题。

本文作者针对这一痛点，提出了一种处理表格数据的新颖解决方案。作者首先系统性地梳理了RAG系统中表格处理的核心技术，包括表格解析、索引结构设计等，并评述了现有的一些开源解决方案。在此基础上，作者提出了自己的创新之处——利用Nougat工具准确高效地解析文档中的表格内容，使用语言模型对表格及其标题进行内容摘要，最后构建一种新型的document summary索引结构，并给出了完整的代码实现细节。

这种方法的优点是既能有效解析表格，又能全面考虑表格摘要与表格之间的关系，且无须使用多模态 LLM ，能够节省解析成本。让我们拭目以待该方案在实践中的进一步应用和发展。

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
project		project
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

自然语言处理

About

Releases

Packages

License

Victor94-king/NLP__ManVictor

Folders and files

Latest commit

History

Repository files navigation

自然语言处理

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages