AI and Language: Basic topics to learn, use for NLP and train - Brain
This article presents essential information for language acquisition for both humans and computer software, particularly artificial intelligence (AI) and natural language processing (NLP) implementations. It defines key topics in a relevant order, outlines required information, and suggests actionable steps for self-learning and model training.
The primary focus is on leveraging data, patterns, and identifying the most effective methods to achieve the goal of learning a new language, or developing and training a model for AI or NLP purposes.
This resource can be applied to any language, and continual improvement is possible through implementation results and gathering feedback.
The key principles at start level:
- The structure of language in a learning process:
- word
- phrases
- sentences
- paragraph
- compositions
- High frequency words.
- Words in context.
- Intensive and extensive exposition to the language.
- Apply the AI and NLP principles, and use the available tools.
Specific Actions and base information:
- Phonetic Alphabet - IPA (It look hard but is easy).
- Vowels (15) and consonants sound.
- The music principle: "You can't listen what you don't know".
- 200 Most Common Words - The right pronunciation -.
- Will be available to understand about the 70% of the language.
- 100 most frequet words
- See the Oxford list- An overview and outline by topics-
- Sentences Structure.
- A good grammar resource.
- Text Patterns (the right way to a good speech and write).
- Listen (shadowing repeat) and Write (read and listen to write, transcribe).
- This intensive reading of the language material will help us a lot of, be focus in it. - comprehension-.
- Extensive reading, listening and transcribing through all the process and steps to archive advance level.
- Engage in conversations (either in person or online).
- AI use: software that check your writing and spell.
- Go just direct to the advance topics.
You cant learn to write well only writing.
You can improve your speech skills by speaking (fixing errors and get new strategic).
Word in context, avoid learn just words.
- Skimming (get a basic idea about the text -magazine, newspaper)
- Scanning (quickly scan to get a specific information)
- In-depth reading (after skimming)
- Intensive (time consuming, high focus, best way to learn - intensive practice before exam-)
- Extensive (evolve enjoyment)
For a Test:
- Grammar and Vocabulary (rules, tenses, parts of speech, sentence structure and word in context).
- Reading Comprehension (public articles).
- Listening Skills (podcasts).
- Speaking Practice.
- Writing Skills (write essay).
- Practice/run Tests
Additional Resource:
- Lewis, Norman, Word Power Made Easy: Vocabulary Builder. Garden City, N.Y., Doubleday, 1949.
- Cambridge - test your english.
- Cambridge dictionary.
- IELTS test materials.
- Writting check list.
- GRE study material.
-------------------------------------------------------------------------------------------------------------------------
Patterns - Training a Model or Training yourself
AI Patterns applied to life.
The statistics information about words is relevant.
- Dimension: How many words the language have?
- Alphabet (symbols and sound)
- Words
- Prioritization/Optimization:
- Word list by frequency
Implementation datasets source:
- NLP_project
- GRE high clustering similar words.
- kaggle_word_list_data
- English frequency word list
- https://storage.googleapis.com/books/ngrams/books/datasetsv3.html
Data structure [PENDING - in writing process]
Implementation actions and tasks:
- Cluster and aggregation (group and select words for your nearest knowledge <area>)
- Basic and extended patterns (get the basic idea but reinforce with the corpus):
- Sentence patterns
- Paragraph and text
- Speech patterns
- Corpus and data (reinforcement)
- Text (great writers)
- Video, audio (podcast, audio book, music,
- Dialogue and public speech
- Detect errors (get it frequency with checklist rules for fix and mitigate)
- Immersion and test (simultaneous and continuous)
- Place and real people (heavily used).
Train:
- Language corpus text: as many text as possible (write, transcribe and ride)
- Recorded audio and videos: process the sounds.
Learning and Brain Process
Relevant information:
- The Economist - Your brain from birth to death
- Brain synapses and Alzheimer.
Data Base:
https://github.com/chrplr/openlexicon/blob/master/datasets-info/README.md
in Python:
import pandas as pd
lex = pd.read_csv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv', sep='\t')
lex.head()
in R:
library(readr)
lex = read_tsv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv')
head(lex)
Francais Lexicon Data Base
https://github.com/chrplr/openlexicon
Language Models
https://ai.meta.com/blog/5-steps-to-getting-started-with-llama-2/